01
Sequential vs Parallel
Race conditions, data dependencies, parallel hazards
Open in Colab
02
Amdahl's Law
Speedup limits, sequential bottlenecks, efficiency
Open in Colab
03
Parallel Patterns
Embarrassingly parallel, reduction, stencil, irregular
Open in Colab
04
NumPy Vectorization
Loops vs vectors, broadcasting, GPU thinking
Open in Colab
00
Environment Check
Verify GPU setup and dependencies
Open in Colab
01
NumPy Baseline
Establish CPU performance benchmarks
Open in Colab
02
CuPy Introduction
Instant GPU speedup with drop-in NumPy replacement
Open in Colab
03
GPU Architecture
Understand SMs, warps, and the execution model
Open in Colab
04
First Triton Kernel
Write your first GPU kernel with Triton
Open in Colab
05
Memory Hierarchy
Profile and understand memory access patterns
Open in Colab
06
Tiling Basics
Implement tiled memory access for better cache utilization
Open in Colab
07
Fast Matrix Multiplication
Achieve 500+ GFLOPS with optimized matmul
Open in Colab
01
Profiling with Nsight
Learn to use Nsight Compute for kernel analysis
Open in Colab
02
Memory Coalescing
Optimize global memory access patterns
Open in Colab
03
Bank Conflicts
Eliminate shared memory bottlenecks
Open in Colab
04
Software Pipelining
Overlap compute and memory operations
Open in Colab
05
TMA (Hopper+)
Hardware-accelerated async data movement
Open in Colab
07
Optimized GEMM
Put it all together for peak performance
Open in Colab
01
Dot Product Attention
Implement basic Q K^T computation
Open in Colab
02
The Softmax Problem
Why naive softmax fails at scale
Open in Colab
03
Stable Softmax
Numerical stability with max subtraction
Open in Colab
04
Full Attention
Complete attention implementation
Open in Colab
06
Tiled Attention
Block-wise computation for memory efficiency
Open in Colab
07
FlashAttention
Production-grade fused attention kernel
Open in Colab
01
FP8 Conversion
Convert between floating point formats
Open in Colab
02
Quantization Fundamentals
Symmetric and asymmetric quantization
Open in Colab
03
INT8 and INT4
Integer quantization for inference
Open in Colab
05
KV Cache Strategy
Efficient key-value cache management
Open in Colab
06
Fused Quantized Attention
Combine quantization with FlashAttention
Open in Colab
07
Production Integration
Deploy optimized kernels in serving systems
Open in Colab