Memory Hierarchy
GPU performance lives or dies by memory access patterns. Master the hierarchy from registers to HBM and learn the techniques that separate 10% from 80% hardware utilization.
- List the GPU memory types in order of speed and size
- Explain why memory bandwidth is often the bottleneck
- Demonstrate coalesced vs uncoalesced memory access patterns
- Identify and fix shared memory bank conflicts
- Calculate effective memory bandwidth for a given access pattern
Memory-Bound Reality
Modern GPUs have enormous compute capacity. A B200 delivers ~2.5 PFLOPS of FP8 compute. But that compute is useless if you can't feed it data fast enough.
Most GPU workloads are memory-bound, not compute-bound. The arithmetic intensity (FLOPS per byte) of your kernel determines which regime you're in.
Arithmetic Intensity
AI = FLOPS / Bytes transferred
For a B200 with ~8 TB/s bandwidth and ~2500 TFLOPS (FP16):
Ridge point = 2500 TFLOPS / 8 TB/s = 312.5 FLOPS/byte
Below 312.5: Memory-bound (bandwidth limited)
Above 312.5: Compute-bound (can saturate ALUs)
GEMM has high AI (~hundreds). Element-wise ops have low AI (~1-2).
Peak bandwidth is theoretical. Achieving even 80% requires perfect access patterns. Random access can drop effective bandwidth by 10-100x.
Four Levels of Memory
GPU memory forms a hierarchy trading capacity for speed. Each level has distinct characteristics and use cases. Click each tier to learn more.
Per-thread private storage. instant, like your own name
Characteristics:
- Each thread has private registers (up to 255 per thread)
- Very low latency (~1 cycle read-after-write)
- Compiler-managed allocation
- Register pressure limits occupancy
Usage: Loop variables, intermediate results, frequently accessed values. The compiler will spill to local memory (slow!) if you use too many.
Block-level scratchpad. quick glance at a sticky note
Characteristics:
- Shared among all threads in a block
- On-chip SRAM with very low latency
- Organized into 32 banks (bank conflicts possible)
- Configurable L1/SMEM split on some architectures
Usage: Inter-thread communication, tiled algorithms, reused data. Essential for efficient matrix multiplication.
Chip-wide cache. Hardware-managed.
Characteristics:
- Shared across all SMs on the GPU
- Automatic caching of global memory
- Can use persistence hints (cudaAccessPolicyWindow)
- Helps with irregular access patterns
Usage: Automatic caching. Can hint persistence for working sets that fit. Effective for data reused across blocks.
Main GPU memory. walk to filing cabinet in another room
Characteristics:
- HBM3e with ~8 TB/s peak bandwidth (B200)
- Accessed via 128-byte cache lines
- Coalescing critical for performance
- Latency hidden by warp scheduling
Usage: Input/output tensors, large datasets. Access patterns determine whether you get 10% or 90% of peak bandwidth.
Register access: ~1 cycle. HBM access: ~400 cycles. This enormous gap (hundreds of cycles) is why data reuse dominates GPU optimization. Every byte loaded from HBM should be used as many times as possible before eviction.
Cycle counts shown above are approximate and vary by architecture, workload, and memory access patterns. Actual latency depends on factors like cache hits, bank conflicts, and memory coalescing. For precise measurements, refer to Jia et al.'s GPU microbenchmarking or use Nsight Compute profiling.
Why is the 128-byte cache line important?
Memory Access Patterns
When threads in a warp access global memory, the hardware combines (coalesces) their requests into as few transactions as possible. Proper coalescing is the single most important optimization for memory-bound kernels.
Understanding thread indexing is key. Need a refresher? See Prerequisites: Index Arithmetic
Good: Coalesced
// Thread i accesses element i
data[threadIdx.x]
// 1 transaction for 32 threads
Bad: Strided
// Thread i accesses element i*32
data[threadIdx.x * 32]
// 32 transactions! (worst case)
Shared Memory Banks
Shared memory is divided into 32 banks. Each bank can serve one address per cycle. When multiple threads in a warp access different addresses in the same bank, accesses serialize—this is a bank conflict.
Consecutive 4-byte words map to consecutive banks. Bank = (address / 4) % 32.
Conflict-Free
// Each thread hits different bank
smem[threadIdx.x]
// All 32 accesses parallel
N-way Conflict
// Stride of 32 = same bank!
smem[threadIdx.x * 32]
// Serialized: 32x slower
If multiple threads read the same address, it's a broadcast—no conflict. Conflicts only occur when threads access different addresses in the same bank.
Tensor Memory Accelerator
Hopper introduced TMA—dedicated hardware for bulk data movement between HBM and shared memory. TMA offloads address computation from SMs, enabling asynchronous tile transfers.
Global tensors
Address gen + transfer
Ready for compute
TMA vs Manual Loads
| Aspect | Manual (LDG/STG) | TMA |
|---|---|---|
| Address computation | SM cycles | TMA unit (free) |
| Tile dimensions | Manual indexing | Descriptor-based |
| Multicast | Not available | Built-in to multiple SMs |
| Async | cp.async | Native, with barriers |
// TMA descriptor setup (host side)
CUtensorMap tensor_map;
cuTensorMapEncodeTiled(&tensor_map,
CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
2, // 2D tensor
global_ptr, // Base address
{N, K}, // Global shape
{lda * sizeof(half), sizeof(half)}, // Strides
{TILE_N, TILE_K}, // Box (tile) shape
...);
// Kernel: single instruction loads entire tile
cp.async.bulk.tensor.2d.shared::cluster.global.tile
[smem_ptr], [tensor_map, {tile_x, tile_y}];
TMA shines for structured tile access patterns (GEMM, convolution, attention). For irregular access, manual loads may still be necessary. CuTe abstracts TMA through its copy operations.
Putting It Together
Memory optimization follows a hierarchy of impact. Address these in order:
Optimization Priority
-
Coalescing (10-100x impact)
Ensure warps access contiguous memory. Transpose data if needed. -
Data Reuse via SMEM (2-10x impact)
Tile algorithms to maximize reuse. Classic: GEMM tiling. -
Bank Conflict Elimination (1.1-2x impact)
Pad shared memory arrays or use conflict-free access patterns. -
Occupancy Tuning (1.1-1.5x impact)
Balance register/SMEM usage against parallelism.
Use ncu (Nsight Compute) to measure actual memory throughput, cache hit rates,
and bank conflicts. Theoretical analysis guides optimization; profiling validates it.
What if register spilling occurs?
lmem in ncu output. Fix by: reducing live values,
using __launch_bounds__, or accepting lower occupancy for more registers.
Hands-On Labs
Citations & Further Reading
Video Resources
Deep dives into GPU memory optimization.
Comprehensive introduction to NVIDIA's GPU parallel programming architecture including memory hierarchy.
Watch on YouTubePractical walkthrough of coalescing, bank conflicts, and memory access patterns.
Watch on YouTubePrimary Documentation
-
NVIDIA CUDA C++ Programming Guide
Chapter 5: Memory Hierarchy - Registers, Shared Memory, Global Memory
docs.nvidia.com/cuda/cuda-c-programming-guide -
NVIDIA CUDA C++ Best Practices Guide
Memory Optimizations: Coalescing, Bank Conflicts, Caching
docs.nvidia.com/cuda/cuda-c-best-practices-guide -
NVIDIA Hopper Architecture Whitepaper
TMA (Tensor Memory Accelerator), Asynchronous Copy, Distributed Shared Memory
resources.nvidia.com/en-us-hopper-architecture -
NVIDIA GB200 NVL72 Specifications
HBM3e bandwidth: 576 TB/s total (72 GPUs) = ~8 TB/s per GPU
nvidia.com/data-center/gb200-nvl72 -
NVIDIA Nsight Compute Documentation
Memory throughput analysis, profiling, bottleneck identification
docs.nvidia.com/nsight-compute
Key Specifications with Sources
| Specification | Value | Source |
|---|---|---|
| Shared memory banks | 32 | CUDA Programming Guide, Ch. 5 |
| Cache line size | 128 bytes | CUDA Best Practices Guide |
| Max registers per thread | 255 | CUDA Programming Guide |
| B200 HBM3e bandwidth | ~8 TB/s per GPU | GB200 NVL72 Specifications |
| H100 shared memory | Up to 228KB/SM | H100 Datasheet |
| Memory latency (HBM) | ~400-800 cycles | Varies by access pattern |
Memory latency figures are approximations. Actual latencies depend on
access patterns, cache behavior, memory contention, and specific workloads.
Always use ncu (Nsight Compute) to profile your actual kernels.