Chapter 1

GPU Fundamentals

Understanding the hardware execution model. From SMs to warps to threads— learn how modern NVIDIA GPUs actually execute your code.

What You'll Learn

Explain why GPUs optimize for throughput over latency
Describe the SM → Warp → Thread hierarchy
Calculate theoretical occupancy given resource constraints
Identify warp divergence in code and explain its performance impact
Interpret basic GPU specifications (SM count, memory bandwidth)

01 - THE BIG PICTURE

Throughput Over Latency

CPUs optimize for latency—making single tasks fast. GPUs optimize for throughput—completing many tasks in parallel, even if each individual task takes longer.

A CPU might have 8-16 cores running at 4+ GHz with massive caches. A GPU has thousands of smaller cores running at ~2 GHz with limited cache per core. The magic is in the parallelism.

~192

192GB

~8 TB/s

~2.5 GHz

192 SMs = 192 mini-processors running thousands of threads each
8 TB/s = every book ever written in under 2 seconds

*B200 specifications from NVIDIA Blackwell Architecture. Verify against official datasheets for production use.

Key Insight

GPU programming is about keeping thousands of threads busy. When one thread waits for memory, others execute. This latency hiding is fundamental to GPU performance.

A GPU is faster than a CPU at parallel workloads because of:

Higher clock speed

More parallel execution units

Larger cache per core

02 - EXECUTION HIERARCHY

GPU → SM → Warp → Thread

NVIDIA GPUs have a strict hierarchy. Understanding each level is essential for writing efficient kernels.

GPU

GPU Device

Blackwell B200: up to 192 SMs, 192GB HBM3e, ~8TB/s bandwidth*

The top-level device. What matters most: SM count (parallel execution units) and memory bandwidth (data throughput). More SMs = more parallelism. Higher bandwidth = faster data movement.

Streaming Multiprocessor

The fundamental execution unit

Each SM contains: 4 warp schedulers, ~256KB registers, 228KB shared memory, L1 cache, CUDA cores, and Tensor Cores. Multiple thread blocks can run on one SM simultaneously (limited by resources).

Warp

32 Threads in Lockstep

The atomic scheduling unit (SIMT) like a marching band playing in unison

This is the most important concept. All 32 threads in a warp execute the same instruction simultaneously. Divergent branches serialize execution. Always think in warps, not individual threads.

Warpgroup (Hopper+)

4 warps (128 threads) for Tensor Core operations

Modern Tensor Core operations work at warpgroup granularity. This enables larger tiles (64x64+), asynchronous execution, and better hardware utilization. CuTe DSL uses warpgroups for SM90+ architectures.

Thread

Individual execution context with private registers

Each thread has its own registers and local memory. Threads within a warp can communicate via shuffle instructions. Threads within a block can communicate via shared memory and synchronize with __syncthreads().

Why does the warp size matter?

The warp size of 32 is a hardware constant dating back to early NVIDIA GPUs. Memory transactions, shuffle operations, and voting all happen at warp granularity. Block sizes should always be multiples of 32 to avoid partial warps (wasted threads). The 128-thread warpgroup on Hopper+ exists specifically to match Tensor Core tile sizes.

03 - WARPS

The Warp: GPU's Atomic Unit

A warp is 32 threads that execute in SIMT (Single Instruction, Multiple Threads) fashion. Every thread in a warp executes the same instruction, but on different data.

Click buttons to see how warps execute. Green = active, Orange = diverged/waiting.

Status

Click a button to simulate warp execution.

Warp Divergence

When threads in a warp take different branches (if/else), execution serializes. Both paths run, with inactive threads masked. 16 threads per path = 50% throughput. Minimize divergence within warps for maximum performance.

A warp contains how many threads?

16 threads

32 threads

64 threads

128 threads (that's a warpgroup)

If you launch 256 threads per block, how many warps is that?

4 warps

8 warps (256 / 32 = 8)

16 warps

04 - BLOCKS & GRIDS

Organizing Your Parallelism

Block (also called CTA - Cooperative Thread Array): A group of warps that share resources and can synchronize. Blocks run on a single SM.

Grid: Your problem decomposition into blocks. The grid is how you map your problem to the GPU's parallel execution model.

Block Composition

Grid (your problem)
 └── Block 0 (256 threads = 8 warps)
 │    ├── Warp 0: threads 0-31
 │    ├── Warp 1: threads 32-63
 │    ├── ...
 │    └── Warp 7: threads 224-255
  │    └── [Shared Memory: 48KB]
 │    └── [Can __syncthreads()]
 │
 └── Block 1 (256 threads = 8 warps)
 └── Block 2 ...
 └── Block N-1

Block Size Selection

Common block sizes and their trade-offs:

128 threads (4 warps): 1 warpgroup, good for Tensor Core ops
256 threads (8 warps): Balanced, most common choice
512 threads (16 warps): More parallelism, higher register pressure

Block size affects: shared memory per thread, register availability, and occupancy.

What can threads within the same block do that threads in different blocks cannot?

Access global memory

Share data via shared memory and synchronize

Run on the same SM

Use Tensor Cores

05 - OCCUPANCY

Keeping the GPU Busy

Occupancy is the ratio of active warps to the maximum warps an SM can support. Higher occupancy generally means better latency hiding, but it's not the only factor in performance.

Block Size

Registers per Thread

Shared Memory per Block

Max Warps per SM 64

Active Warps 32

Occupancy 50%

50%

Simplified model based on Hopper SM (64 max warps, 256KB registers, 228KB shared memory).

When is low occupancy actually better?

Compute-bound kernels with high instruction-level parallelism (ILP) often perform better at 50% occupancy with more registers per thread. More registers = fewer spills to local memory, more values kept in fast storage. Use --maxrregcount in nvcc or occupancy calculators to experiment. Always profile with real workloads.

Occupancy vs Performance

100% occupancy doesn't guarantee best performance. Sometimes using more registers (lower occupancy) enables better instruction-level parallelism. Profile your actual kernel to find the sweet spot.

Which resource does NOT directly limit occupancy?

Registers per thread

Shared memory per block

Global memory size

PRACTICE

Hands-On Labs

📓

Part 1 Notebooks

8 labs: NumPy to CuPy to Triton matmul journey

REFERENCES

Citations & Further Reading

Video Resources

High-quality explanations of GPU architecture concepts.

How do Graphics Cards Work? (Branch Education)

Excellent visual explanation of GPU architecture fundamentals, parallelism, and memory hierarchy.

Watch on YouTube

CUDA Programming Model (NVIDIA Developer)

Official NVIDIA explanation of threads, blocks, grids, and the CUDA execution model.

Read: CUDA Refresher Series - developer.nvidia.com

Primary Documentation

NVIDIA CUDA C++ Programming Guide
Chapters 4-5: Thread Hierarchy, SIMT Architecture, Memory Hierarchy
docs.nvidia.com/cuda/cuda-c-programming-guide
NVIDIA CUDA C++ Best Practices Guide
Performance optimization, occupancy, memory access patterns
docs.nvidia.com/cuda/cuda-c-best-practices-guide
NVIDIA Hopper Architecture Whitepaper
SM specifications, warpgroup operations, Tensor Core details
resources.nvidia.com/en-us-hopper-architecture
NVIDIA Blackwell Architecture
Fifth-gen Tensor Cores, NVLink 5, 208B transistors
nvidia.com/blackwell-architecture
NVIDIA GB200 NVL72 Specifications
Official Blackwell product specifications and performance data
nvidia.com/data-center/gb200-nvl72
NVIDIA H100 Tensor Core GPU Datasheet
Hardware specifications: 256KB registers/SM, 228KB SMEM, 64 max warps
H100 Datasheet (PDF)

Key Specifications with Sources

Specification	Value	Source
Warp size	32 threads	CUDA Programming Guide, Ch. 4
Warpgroup (Hopper+)	128 threads (4 warps)	Hopper Architecture Whitepaper
H100 registers/SM	256KB (65,536 x 32-bit)	H100 Datasheet
H100 shared memory/SM	Up to 228KB configurable	H100 Datasheet
Max warps per SM	64 (Hopper)	Hopper Architecture Whitepaper
Blackwell transistors	208 billion	NVIDIA Blackwell Architecture
B200 HBM bandwidth	~8 TB/s	GB200 NVL72 Specifications

Note: GPU specifications vary by SKU and configuration. Always verify against official NVIDIA datasheets for production use. Memory latencies are approximate and vary by access pattern and workload.