Optimization
Push your kernels to 80%+ of peak performance. Profiling, bank conflicts, Tensor Cores, and TMA—the techniques that separate good code from great code.
- Use Nsight Compute to identify kernel bottlenecks
- Apply tiling to improve cache utilization
- Optimize memory access patterns for coalescing
- Balance occupancy against per-thread resources
- Achieve >50% of theoretical memory bandwidth
This chapter assumes you've written tiled kernels. Chapter 3: First Kernels | Chapter 2: Memory Hierarchy
Part 2: Optimization Labs - Profiling through optimized GEMM
Measure Before You Optimize
The first rule of optimization: don't guess. Use Nsight Compute to understand where time is actually spent.
# Profile a kernel
ncu --set full -o profile python my_kernel.py
# Key metrics to check:
# - Memory throughput (% of peak)
# - Compute throughput (% of peak)
# - Occupancy (warps/SM)
# - L1/L2 hit rates
# - Bank conflicts
Common Bottlenecks
| Symptom | Likely Cause | Fix |
|---|---|---|
| High memory latency | Uncoalesced access | Fix access patterns |
| Low occupancy | Too many registers | Reduce register pressure |
| SMEM conflicts | Bank conflicts | Pad arrays, change stride |
| Low compute % | Memory bound | Increase data reuse |
If you're at 80% of theoretical peak, you're done. The last 20% requires heroic effort and architecture-specific tricks. Ship it and move on.
Shared Memory Banks
Shared memory is divided into 32 banks. Each bank can serve one address per cycle. When multiple threads in a warp access different addresses in the same bank, accesses serialize—this is a bank conflict.
Bank = (address / 4) % 32. Consecutive 4-byte words map to consecutive banks.
Conflict-Free
// Each thread hits different bank
smem[threadIdx.x]
// All 32 accesses parallel
N-way Conflict
// Stride of 32 = same bank!
smem[threadIdx.x * 32]
// Serialized: 32x slower
If multiple threads read the same address, it's a broadcast—no conflict. Conflicts only occur when threads access different addresses in the same bank.
Matrix Multiply-Accumulate Units
Tensor Cores are specialized hardware for matrix operations. They perform matrix multiply-accumulate (MMA) operations on small tiles in a single cycle—8-16x faster than CUDA cores for compatible workloads.
Tensor Core Specs (H100 SXM)
| Format | Shape (M×N×K) | Peak TFLOPS |
|---|---|---|
| FP16 | 16×8×16 | ~990 |
| BF16 | 16×8×16 | ~990 |
| FP8 | 16×8×32 | ~1979 |
| INT8 | 16×8×32 | ~1979 |
Source: NVIDIA H100 Datasheet. MMA shapes from PTX ISA.
# Triton automatically uses Tensor Cores when:
# 1. Matrix dimensions align to MMA shapes
# 2. Data types are FP16, BF16, FP8, or INT8
# 3. Using tl.dot() operation
@triton.jit
def matmul_kernel(...):
# This uses Tensor Cores automatically
acc = tl.dot(a, b, acc) # MMA operation
Tensor Core performance requires alignment. Tile sizes should be multiples of MMA shapes (16, 32, etc.). Misaligned tiles fall back to slower CUDA core execution.
Tensor Memory Accelerator
Hopper introduced TMA—dedicated hardware for bulk data movement between HBM and shared memory. TMA offloads address computation from SMs, enabling asynchronous tile transfers.
Global tensors
Address gen + transfer
Ready for compute
TMA vs Manual Loads
| Aspect | Manual (LDG/STG) | TMA |
|---|---|---|
| Address computation | SM cycles | TMA unit (free) |
| Tile dimensions | Manual indexing | Descriptor-based |
| Multicast | Not available | Built-in to multiple SMs |
| Async | cp.async | Native, with barriers |
TMA shines for structured tile access patterns (GEMM, convolution, attention). For irregular access, manual loads may still be necessary. CuTe and Triton abstract TMA through their copy operations.
Optimization Priority Order
Memory optimization follows a hierarchy of impact. Address these in order:
Memory Coalescing
Adjacent threads access adjacent addresses. Single biggest impact on bandwidth.
Data Reuse (Tiling)
Load once from HBM, use many times from SMEM. Transforms memory-bound to compute-bound.
Occupancy
Enough warps to hide latency. Balance registers/SMEM against parallelism.
Bank Conflicts
Avoid same-bank access in SMEM. Pad arrays or use different strides.
Tensor Cores + TMA
Hardware acceleration. Requires proper alignment and data types.
Performance Targets
| Metric | Good | Great |
|---|---|---|
| Memory Bandwidth | 60% | 80%+ |
| Compute Throughput | 50% | 70%+ |
| vs cuBLAS | 50% | 80%+ |
Hands-On Labs
Part 2 Labs
- Profiling - Learn Nsight Compute
- Coalescing - Optimize memory access patterns
- Bank Conflicts - Eliminate shared memory bottlenecks
- Pipelining - Overlap compute and memory
- TMA (Hopper+) - Hardware-accelerated data movement
- Tensor Cores - MMA operations
- Optimized GEMM - Put it all together
References
- Nsight Compute Documentation - NVIDIA's kernel profiler
- CUDA Programming Guide: Shared Memory - Bank conflict details
- PTX ISA: Warp-Level Matrix Instructions - MMA shapes and semantics
- Hopper Tuning Guide - TMA and H100-specific optimizations
- NVIDIA H100 Datasheet - Official specs and TFLOPS numbers
- NVIDIA Tensor Core Resources - Architecture whitepapers