Practice Notebooks

01

Sequential vs Parallel

Race conditions, data dependencies, parallel hazards

Open in Colab

02

Amdahl's Law

Speedup limits, sequential bottlenecks, efficiency

Open in Colab

03

Parallel Patterns

Embarrassingly parallel, reduction, stencil, irregular

Open in Colab

04

NumPy Vectorization

Loops vs vectors, broadcasting, GPU thinking

Open in Colab

00

Environment Check

Verify GPU setup and dependencies

Open in Colab

01

NumPy Baseline

Establish CPU performance benchmarks

Open in Colab

02

CuPy Introduction

Instant GPU speedup with drop-in NumPy replacement

Open in Colab

03

GPU Architecture

Understand SMs, warps, and the execution model

Open in Colab

04

First Triton Kernel

Write your first GPU kernel with Triton

Open in Colab

05

Memory Hierarchy

Profile and understand memory access patterns

Open in Colab

06

Tiling Basics

Implement tiled memory access for better cache utilization

Open in Colab

07

Fast Matrix Multiplication

Achieve 500+ GFLOPS with optimized matmul

Open in Colab

01

Profiling with Nsight

Learn to use Nsight Compute for kernel analysis

Open in Colab

02

Memory Coalescing

Optimize global memory access patterns

Open in Colab

03

Bank Conflicts

Eliminate shared memory bottlenecks

Open in Colab

04

Software Pipelining

Overlap compute and memory operations

Open in Colab

05

TMA (Hopper+)

Hardware-accelerated async data movement

Open in Colab

06

Tensor Cores

Use MMA operations for matrix math

Open in Colab

07

Optimized GEMM

Put it all together for peak performance

Open in Colab

01

Dot Product Attention

Implement basic Q K^T computation

Open in Colab

02

The Softmax Problem

Why naive softmax fails at scale

Open in Colab

03

Stable Softmax

Numerical stability with max subtraction

Open in Colab

04

Full Attention

Complete attention implementation

Open in Colab

05

Online Softmax

Single-pass softmax algorithm

Open in Colab

06

Tiled Attention

Block-wise computation for memory efficiency

Open in Colab

07

FlashAttention

Production-grade fused attention kernel

Open in Colab

01

FP8 Conversion

Convert between floating point formats

Open in Colab

02

Quantization Fundamentals

Symmetric and asymmetric quantization

Open in Colab

03

INT8 and INT4

Integer quantization for inference

Open in Colab

04

NVFP4

NVIDIA's 4-bit floating point format

Open in Colab

05

KV Cache Strategy

Efficient key-value cache management

Open in Colab

06

Fused Quantized Attention

Combine quantization with FlashAttention

Open in Colab

07

Production Integration

Deploy optimized kernels in serving systems

Open in Colab

The Parallel Mindset

Foundations: NumPy to Triton

Optimization Deep Dive

Attention Mechanisms

Production & Quantization