Chapter 1

GPU Fundamentals

Understanding the hardware execution model. From SMs to warps to threads— learn how modern NVIDIA GPUs actually execute your code.

What You'll Learn
  1. Explain why GPUs optimize for throughput over latency
  2. Describe the SM → Warp → Thread hierarchy
  3. Calculate theoretical occupancy given resource constraints
  4. Identify warp divergence in code and explain its performance impact
  5. Interpret basic GPU specifications (SM count, memory bandwidth)
01 - THE BIG PICTURE

Throughput Over Latency

CPUs optimize for latency—making single tasks fast. GPUs optimize for throughput—completing many tasks in parallel, even if each individual task takes longer.

A CPU might have 8-16 cores running at 4+ GHz with massive caches. A GPU has thousands of smaller cores running at ~2 GHz with limited cache per core. The magic is in the parallelism.

192GB
~8 TB/s
~2.5 GHz

192 SMs = 192 mini-processors running thousands of threads each
8 TB/s = every book ever written in under 2 seconds

*B200 specifications from NVIDIA Blackwell Architecture. Verify against official datasheets for production use.

Key Insight

GPU programming is about keeping thousands of threads busy. When one thread waits for memory, others execute. This latency hiding is fundamental to GPU performance.

A GPU is faster than a CPU at parallel workloads because of:
Higher clock speed
More parallel execution units
Larger cache per core

02 - EXECUTION HIERARCHY

GPU → SM → Warp → Thread

NVIDIA GPUs have a strict hierarchy. Understanding each level is essential for writing efficient kernels.

Click to explore
GPU
GPU Device

Blackwell B200: up to 192 SMs, 192GB HBM3e, ~8TB/s bandwidth*

The top-level device. What matters most: SM count (parallel execution units) and memory bandwidth (data throughput). More SMs = more parallelism. Higher bandwidth = faster data movement.
SM
Streaming Multiprocessor

The fundamental execution unit

Each SM contains: 4 warp schedulers, ~256KB registers, 228KB shared memory, L1 cache, CUDA cores, and Tensor Cores. Multiple thread blocks can run on one SM simultaneously (limited by resources).
Warp
32 Threads in Lockstep

The atomic scheduling unit (SIMT) like a marching band playing in unison

This is the most important concept. All 32 threads in a warp execute the same instruction simultaneously. Divergent branches serialize execution. Always think in warps, not individual threads.
WG
Warpgroup (Hopper+)

4 warps (128 threads) for Tensor Core operations

Modern Tensor Core operations work at warpgroup granularity. This enables larger tiles (64x64+), asynchronous execution, and better hardware utilization. CuTe DSL uses warpgroups for SM90+ architectures.
T
Thread

Individual execution context with private registers

Each thread has its own registers and local memory. Threads within a warp can communicate via shuffle instructions. Threads within a block can communicate via shared memory and synchronize with __syncthreads().
Why does the warp size matter?
The warp size of 32 is a hardware constant dating back to early NVIDIA GPUs. Memory transactions, shuffle operations, and voting all happen at warp granularity. Block sizes should always be multiples of 32 to avoid partial warps (wasted threads). The 128-thread warpgroup on Hopper+ exists specifically to match Tensor Core tile sizes.
03 - WARPS

The Warp: GPU's Atomic Unit

A warp is 32 threads that execute in SIMT (Single Instruction, Multiple Threads) fashion. Every thread in a warp executes the same instruction, but on different data.

Interactive: Warp Execution

Click buttons to see how warps execute. Green = active, Orange = diverged/waiting.

Status

Click a button to simulate warp execution.

When threads in a warp take different branches (if/else), execution serializes. Both paths run, with inactive threads masked. 16 threads per path = 50% throughput. Minimize divergence within warps for maximum performance.

A warp contains how many threads?
16 threads
32 threads
64 threads
128 threads (that's a warpgroup)
If you launch 256 threads per block, how many warps is that?
4 warps
8 warps (256 / 32 = 8)
16 warps

04 - BLOCKS & GRIDS

Organizing Your Parallelism

Block (also called CTA - Cooperative Thread Array): A group of warps that share resources and can synchronize. Blocks run on a single SM.

Grid: Your problem decomposition into blocks. The grid is how you map your problem to the GPU's parallel execution model.

Block Composition
Grid (your problem)
 └── Block 0 (256 threads = 8 warps)
 │    ├── Warp 0: threads 0-31
 │    ├── Warp 1: threads 32-63
 │    ├── ...
 │    └── Warp 7: threads 224-255
  │    └── [Shared Memory: 48KB]
 │    └── [Can __syncthreads()]
 │
 └── Block 1 (256 threads = 8 warps)
 └── Block 2 ...
 └── Block N-1

Block Size Selection

Common block sizes and their trade-offs:

  • 128 threads (4 warps): 1 warpgroup, good for Tensor Core ops
  • 256 threads (8 warps): Balanced, most common choice
  • 512 threads (16 warps): More parallelism, higher register pressure

Block size affects: shared memory per thread, register availability, and occupancy.

What can threads within the same block do that threads in different blocks cannot?
Access global memory
Share data via shared memory and synchronize
Run on the same SM
Use Tensor Cores

05 - OCCUPANCY

Keeping the GPU Busy

Occupancy is the ratio of active warps to the maximum warps an SM can support. Higher occupancy generally means better latency hiding, but it's not the only factor in performance.

Interactive: Occupancy Calculator
Block Size
Registers per Thread
Shared Memory per Block
Max Warps per SM 64
Active Warps 32
Occupancy 50%
50%

Simplified model based on Hopper SM (64 max warps, 256KB registers, 228KB shared memory).

When is low occupancy actually better?
Compute-bound kernels with high instruction-level parallelism (ILP) often perform better at 50% occupancy with more registers per thread. More registers = fewer spills to local memory, more values kept in fast storage. Use --maxrregcount in nvcc or occupancy calculators to experiment. Always profile with real workloads.
Occupancy vs Performance

100% occupancy doesn't guarantee best performance. Sometimes using more registers (lower occupancy) enables better instruction-level parallelism. Profile your actual kernel to find the sweet spot.

Which resource does NOT directly limit occupancy?
Registers per thread
Shared memory per block
Global memory size

PRACTICE

Hands-On Labs

REFERENCES

Citations & Further Reading

Video Resources

High-quality explanations of GPU architecture concepts.

How do Graphics Cards Work? (Branch Education)

Excellent visual explanation of GPU architecture fundamentals, parallelism, and memory hierarchy.

Watch on YouTube
CUDA Programming Model (NVIDIA Developer)

Official NVIDIA explanation of threads, blocks, grids, and the CUDA execution model.

Read: CUDA Refresher Series - developer.nvidia.com

Primary Documentation

  1. NVIDIA CUDA C++ Programming Guide
    Chapters 4-5: Thread Hierarchy, SIMT Architecture, Memory Hierarchy
    docs.nvidia.com/cuda/cuda-c-programming-guide
  2. NVIDIA CUDA C++ Best Practices Guide
    Performance optimization, occupancy, memory access patterns
    docs.nvidia.com/cuda/cuda-c-best-practices-guide
  3. NVIDIA Hopper Architecture Whitepaper
    SM specifications, warpgroup operations, Tensor Core details
    resources.nvidia.com/en-us-hopper-architecture
  4. NVIDIA Blackwell Architecture
    Fifth-gen Tensor Cores, NVLink 5, 208B transistors
    nvidia.com/blackwell-architecture
  5. NVIDIA GB200 NVL72 Specifications
    Official Blackwell product specifications and performance data
    nvidia.com/data-center/gb200-nvl72
  6. NVIDIA H100 Tensor Core GPU Datasheet
    Hardware specifications: 256KB registers/SM, 228KB SMEM, 64 max warps
    H100 Datasheet (PDF)

Key Specifications with Sources

Specification Value Source
Warp size 32 threads CUDA Programming Guide, Ch. 4
Warpgroup (Hopper+) 128 threads (4 warps) Hopper Architecture Whitepaper
H100 registers/SM 256KB (65,536 x 32-bit) H100 Datasheet
H100 shared memory/SM Up to 228KB configurable H100 Datasheet
Max warps per SM 64 (Hopper) Hopper Architecture Whitepaper
Blackwell transistors 208 billion NVIDIA Blackwell Architecture
B200 HBM bandwidth ~8 TB/s GB200 NVL72 Specifications

Note: GPU specifications vary by SKU and configuration. Always verify against official NVIDIA datasheets for production use. Memory latencies are approximate and vary by access pattern and workload.