| Day | Code | Notes | Progress |
|---|---|---|---|
| 081 | cuda mode: lecture 1 part 3 | profiling | pytorch cuda profiling |
| 080 | cuda mode: lecture 1 part 3 | profiling | inline cuda calls in pytorch |
| 079 | cuda mode: lecture 1 part 3 | profiling | inline cuda calls in pytorch |
| 078 | cuda mode: lecture 1 part 3 | profiling | inline cuda calls in pytorch |
| 077 | cuda mode: lecture 1 part 2 | profiling | profiling kernels in pytorch |
| 076 | cuda mode: lecture 1 | profiling | profiling kernels in pytorch |
| 075 | triton: dropout | triton | dropout bug fixing |
| 074 | triton: dropout | triton | dropout testing |
| 073 | triton: dropout | triton | dropout wrapper |
| 072 | triton: dropout | triton | dropout kernel |
| 071 | triton: matrix multiplication | triton | matmul kernel testing |
| 070 | triton: matrix multiplication | triton | matmul kernel configs and bencharking |
| 069 | triton: matrix multiplication | triton | matmul kernel tester |
| 068 | triton: matrix multiplication | triton | matmul kernel wrapper function |
| 067 | triton: matrix multiplication | triton | matmul kernel function |
| 066 | triton: matrix multiplication | triton | matmul kernel function |
| 065 | triton: matrix multiplication | triton | matmul kernel on paper |
| 064 | triton: matrix multiplication | triton | autotuning |
| 063 | triton: fused softmax | triton | added benchmarks |
| 062 | triton: fused softmax | triton | working fused softmax |
| 061 | triton: fused softmax | triton | benchmarking |
| 060 | triton: fused softmax | triton | kernel call test function |
| 059 | triton: fused softmax | triton | kernel call wrapper testing |
| 058 | triton: fused softmax | triton | kernel call wrapper implementation |
| 057 | triton: fused softmax | triton | gpu specs and kernel call wrapper |
| 056 | triton: fused softmax | triton | understanding fused implementation |
| 055 | triton: fused softmax | triton | fused implementation |
| 054 | triton: fused softmax | triton | ideating fused implementation |
| 053 | triton: fused softmax | triton | naive implementation |
| 052 | triton: vector subtract benchmarking | triton | added benchmarks for vector subtract |
| 051 | triton: naive matmul for turing | triton | fixing naive matrix multiplication for turing gpu |
| 050 | triton: naive matmul | triton | naive matrix multiplication |
| 049 | triton: vector subtraction bug fix | triton | fixing bugs in vec sub |
| 048 | triton: vector subtraction | triton | touching base with basics |
| 047 | triton puzzles: flash attention | triton | started flash attention |
| 046 | triton puzzles: long softmax v2 | triton | softmax on logits, v2 |
| 045 | triton puzzles: long softmax | triton | softmax on logits |
| 044 | triton puzzles: long sum | triton | sum of batch of numbers |
| 043 | triton puzzles: matmul + relu | triton | matrix multiplication and relu |
| 042 | triton puzzles: fused matmul + relu | triton | fused matrix multiplication and relu |
| 041 | triton puzzles: vector addition, row to col | triton | vector addition row and column vectors |
| 040 | triton puzzles: vector addition | triton | vector addition |
| 039 | triton puzzles: constant addition with varying block sizes | triton | constant addition to vector |
| 038 | triton puzzles: constant addition | triton | constant addition |
| 037 | triton puzzles: blocks and loading | triton | 2d tensor loading as blocks |
| 036 | triton puzzles: loading 2d tensors | triton | 2d tensor and tl.store |
| 035 | triton puzzles | triton | tritonviz + debugging meson + puzzles environment setup |
| 034 | benchmarking in triton | triton | triton benchmarking + plots |
| 033 | vector addition in triton | triton | triton setup + vector addition |
| 032 | knn with vectorized distance computation | float4 | knn + vectorized distance computation + float4 operations |
| 031 | knn with batch distance computation | knn | knn + batch distance computation |
| 030 | knn with thrust for sorting | knn | knn + thrust sorting |
| 029 | knn with tiled distance computation | knn | knn + tiling |
| 028 | knn with shared memory distance calculation | knn | knn + shared memory |
| 027 | baseline gpu knn | knn | knn |
| 026 | bitonic sort with shared memory | sorting | bitonic sort with shared memory |
| 025 | bitonic sort | sorting | bitonic sort |
| 024 | histogram with shared memory and atomic add | using shared memory and atomic adds | atomic operations, race conditions |
| 023 | histogram with atomicadds | using atomic adds | atomic operations, race conditions |
| 022 | register pressure and spilling | reducing register pressure | spilling, high and low register pressure |
| 021 | optimizing warp divergence | optimized warp divergence | warp divergence and optimzing for it |
| 020 | hillis steele prefix sum (optimized) | optimized parallel prefix sum | hillis steele with shared memory |
| 019 | prefix sum (naive) | parallel prefix sum | prefix sum, parallel scanning |
| 018 | parallel reduction (optimized) | optimized parallel reduction | shuffle sync with mask and warps |
| 017 | parallel reduction (naive) | naive parallel reduction | parallel reduction with shared memory |
| 016 | l1 and l2 cache | read about l1, l2 cache and how to write cache friendly code | l1, l2 cache |
| 015 | matrix multiplication with block tiling | optimizing mat mul using block tiling | block tiling |
| 014 | matrix multiplication sgemm shared memory | optimizing mat mul using memory blocking | shared memory, memory blocking |
| 013 | optimizing matrix multiplication | optimizing mat mul using coalescing | coalescing memory and warp scheduling |
| 012 | shared memory | matrix multiplication and shared memory | read about shared memory, registers and warps, bank conflicts, reading matrix multiplication blog by siboehm |
| 011 | optimizing matrix multiplication | matrix multiplication and profiling | using nsys and nvprof, reading matrix multiplication blog by siboehm |
| 010 | face blur | read matrix multiplication blog | reading matrix multiplication blog by siboehm, using a compiled kernel in python |
| 009 | matrix transpose | matrix transpose and matrix multiplication blog | started reading matrix multiplication blog by siboehm, started chapter 4 of PMPP |
| 008 | matrix multiply and helpers | matrix multiplication, pinned memory and BLAS | read about pinned memory, pageable memory and cudaHostAlloc(). finished chapter 3 of PMPP |
| 007 | vector multiply and helpers | internal structure of blocks | setup gpu env on new server. studied heirarchy of execution within the streaming multiprocessor. created helpers file. |
| 006 | gaussianBlurSharedMemory with event times | event times and performance measurement | added perf measurement code to gaussian blur with shared memory kernel |
| 005 | gaussianBlurSharedMemory | PMPP Chapter 3 & exploration | built on top of gaussian blur; learnt about shared memory and implemented it; |
| 004 | gaussianBlur | PMPP Chapter 3 | built on top of image blur; struggling to understand multidimensionality; |
| 003 | imageBlur | PMPP Chapter 3 | read parts of image blur and about better ways to handle errors, image blurring logic |
| 002 | colorToGrayScaleConversion | PMPP Chapter 3 | read half of chapter 2 of pmpp, implemented color to grayscale conversion |
| 001 | vecAbsDiff | PMPP Chapter 2 | read chapter 2 of pmpp, implemented vector absolute difference kernel |
| 000 | - | PMPP | setup environment, lecture 1 of ECE 408, chapter 1 of PMPP |
- Programming Massively Parallel Processors
- CUDA 120 Days Challenge
- ECE 408
- LLMs
| Objective | Topic | Task/Implementation | Status |
|---|---|---|---|
| Phase 1: Foundations | Goal: Understand CUDA fundamentals, memory hierarchy, and write basic optimized kernels. | ||
| 1 | CUDA Setup & First Kernel | Install CUDA, write a vector addition kernel | ✅ |
| 2 | Thread Hierarchy | Grids, blocks, threads, experimenting with configurations | ✅ |
| 3 | Memory Model Basics | Global, shared, local memory overview | ✅ |
| 4 | Memory Coalescing | Optimize vector addition using shared memory | ✅ |
| 5 | Matrix Multiplication (Naïve) | Implement basic matrix multiplication | ✅ |
| 6 | Matrix Multiplication (Optimized) | Use shared memory to optimize | |
| 7 | Profiling Basics | Use nvprof and nsys to analyze kernels |
✅ |
| 8 | L1/L2 Cache Effects | Study cache behavior and memory bandwidth | ✅ |
| 9 | Tiled Matrix Multiplication | Further optimize matrix multiplication | |
| 10 | Register Pressure | Optimize register usage and reduce spilling | ✅ |
| 11 | Warp Execution Model | Avoiding warp divergence | ✅ |
| 12 | Parallel Reduction (Naïve) | Implement sum/max reductions | ✅ |
| 13 | Parallel Reduction (Optimized) | Optimize with warp shuffle (__shfl_sync) |
✅ |
| 14 | Code Review & Optimization | Refine and benchmark previous work | |
| 15 | Parallel Scan (Prefix Sum) | Implement parallel scan algorithm | ✅ |
| 16 | Histogram (Naïve) | Implement histogram using global memory atomics | ✅ |
| 17 | Histogram (Optimized) | Use shared memory to optimize histogram | ✅ |
| 18 | Parallel Sorting | Implement bitonic or bucket sort | ✅ |
| 19 | k-Nearest Neighbors | Implement kNN search using CUDA | |
| 20 | Code Review & Benchmarking | Optimize and compare previous implementations | |
| Phase 2: ML Operators | Goal: Implement and optimize core ML kernels. | ||
| 21 | Dense Matrix-Vector Multiplication | Implement y = Wx + b in CUDA |
|
| 22 | Fully Connected Layer | Implement dense forward pass | |
| 23 | ReLU & Softmax | Implement activation functions | |
| 24 | Backpropagation | Implement BP for a single layer | |
| 25 | 1D Convolution (Naïve) | Implement 1D convolution | |
| 26 | 1D Convolution (Optimized) | Optimize with shared memory | |
| 27 | Profiling DL Kernels | Compare CUDA vs. PyTorch performance | |
| 28 | 2D Convolution (Naïve) | Implement 2D convolution | |
| 29 | 2D Convolution (Optimized) | Use shared memory for optimization | |
| 30 | Im2Col + GEMM Conv | Implement im2col approach | |
| 31 | Depthwise Separable Conv | Optimize CNN inference workloads | |
| 32 | Batch Norm & Activation Fusion | Optimize BN + activation | |
| 33 | Code Review & Optimization | Refine previous work | |
| 34 | Benchmarking ML Kernels | Compare different CNN implementations | |
| 35 | LayerNorm in CUDA | Implement LayerNorm from scratch | |
| 36 | Efficient Dropout | Optimize dropout for training speed | |
| 37 | Fused MLP Block | Implement fused MLP (GEMM + activation + dropout) |
|
| 38 | Transformer Attention (Naïve) | Implement self-attention kernel | |
| 39 | Optimized Self-Attention | Optimize self-attention with shared memory | |
| 40 | Benchmark Transformer Layers | Compare against torch.nn.MultiheadAttention |
|
| 41 | Tensor Cores & FP16 | Implement FP16 computation | |
| 42 | Gradient Accumulation | Optimize training with gradient accumulation | |
| 43 | Mixed Precision Training (AMP) | Implement AMP optimizations | |
| 44 | Optimized Attention (FlashAttention) | Implement FlashAttention concepts | |
| 45 | Fused LayerNorm + Dropout | Optimize memory and performance | |
| 46 | Large-Scale Training Profiling | Analyze memory bottlenecks | |
| Phase 3: Advanced CUDA & Large-Scale ML | Goal: Optimize LLMs, multi-GPU training, and memory-efficient kernels. | ||
| 47 | Multi-GPU Data Parallelism | Implement data parallel training | |
| 48 | Multi-GPU Model Parallelism | Implement model parallel training | |
| 49 | Efficient Multi-GPU Communication | Study NCCL and all-reduce ops | |
| 50 | Large Model Optimization | Optimize large-scale deep learning models | |
| 51 | Rotary Embeddings | Implement rotary embeddings in CUDA | |
| 52 | Fused Transformer Block | Implement fused transformer kernel | |
| 53 | LLM Batch Processing | Optimize inference for large batch sizes | |
| 54 | FlashAttention-Like Kernels | Implement memory-efficient attention | |
| 55 | Memory Optimization for LLMs | Optimize LLM inference footprint | |
| 56 | GPU Benchmarking | Compare performance across GPUs | |
| 57 | Architecture-Specific Optimizations | Tune for Ampere/Hopper GPUs | |
| 58 | CUDA Graphs | Implement CUDA Graphs for execution efficiency | |
| 59 | Memory Fragmentation Optimization | Optimize dynamic allocations | |
| 60 | Benchmarking | Compare PyTorch/TensorFlow vs. your CUDA implementations | |
| 61 | Optimize a Real-World Model | Pick a model (BERT/GPT) and optimize | |
| 62 | Custom CUDA Model Acceleration | Implement a custom CUDA-based model optimization |