Chapter 4

Optimization

Push your kernels to 80%+ of peak performance. Profiling, bank conflicts, Tensor Cores, and TMA—the techniques that separate good code from great code.

Building on Chapter 3
You can write working kernels now. But working isn't enough—you need fast. Chapter 2's memory hierarchy becomes your optimization target. Chapter 3's kernel structure becomes your canvas for applying tiling and access pattern fixes.
What You'll Learn
  1. Use Nsight Compute to identify kernel bottlenecks
  2. Apply tiling to improve cache utilization
  3. Optimize memory access patterns for coalescing
  4. Balance occupancy against per-thread resources
  5. Achieve >50% of theoretical memory bandwidth
📚
Prerequisites

This chapter assumes you've written tiled kernels. Chapter 3: First Kernels | Chapter 2: Memory Hierarchy

01 — PROFILING

Measure Before You Optimize

The first rule of optimization: don't guess. Use Nsight Compute to understand where time is actually spent.

# Profile a kernel ncu --set full -o profile python my_kernel.py # Key metrics to check: # - Memory throughput (% of peak) # - Compute throughput (% of peak) # - Occupancy (warps/SM) # - L1/L2 hit rates # - Bank conflicts

Common Bottlenecks

Symptom Likely Cause Fix
High memory latency Uncoalesced access Fix access patterns
Low occupancy Too many registers Reduce register pressure
SMEM conflicts Bank conflicts Pad arrays, change stride
Low compute % Memory bound Increase data reuse
The 80% Rule

If you're at 80% of theoretical peak, you're done. The last 20% requires heroic effort and architecture-specific tricks. Ship it and move on.

Most important metric for memory-bound kernels:
Occupancy
Memory throughput (% of peak)
Compute throughput

02 — BANK CONFLICTS

Shared Memory Banks

Shared memory is divided into 32 banks. Each bank can serve one address per cycle. When multiple threads in a warp access different addresses in the same bank, accesses serialize—this is a bank conflict.

Bank Conflict Visualizer

Bank = (address / 4) % 32. Consecutive 4-byte words map to consecutive banks.

Click a pattern to visualize bank access.

Conflict-Free

// Each thread hits different bank
smem[threadIdx.x]

// All 32 accesses parallel

N-way Conflict

// Stride of 32 = same bank!
smem[threadIdx.x * 32]

// Serialized: 32x slower
Broadcast Exception

If multiple threads read the same address, it's a broadcast—no conflict. Conflicts only occur when threads access different addresses in the same bank.

What stride causes maximum bank conflicts (32-way)?
Stride of 1 (consecutive access)
Stride of 16
Stride of 32 (or any multiple of 32)
Stride of 33
03 — TENSOR CORES

Matrix Multiply-Accumulate Units

Tensor Cores are specialized hardware for matrix operations. They perform matrix multiply-accumulate (MMA) operations on small tiles in a single cycle—8-16x faster than CUDA cores for compatible workloads.

Tensor Core Specs (H100 SXM)

Format Shape (M×N×K) Peak TFLOPS
FP16 16×8×16 ~990
BF16 16×8×16 ~990
FP8 16×8×32 ~1979
INT8 16×8×32 ~1979

Source: NVIDIA H100 Datasheet. MMA shapes from PTX ISA.

# Triton automatically uses Tensor Cores when: # 1. Matrix dimensions align to MMA shapes # 2. Data types are FP16, BF16, FP8, or INT8 # 3. Using tl.dot() operation @triton.jit def matmul_kernel(...): # This uses Tensor Cores automatically acc = tl.dot(a, b, acc) # MMA operation
Alignment Requirements

Tensor Core performance requires alignment. Tile sizes should be multiples of MMA shapes (16, 32, etc.). Misaligned tiles fall back to slower CUDA core execution.

Tiling improves performance by:
Using more threads
Reusing data in fast memory (SMEM)
Reducing computation

04 — TMA

Tensor Memory Accelerator

Hopper introduced TMA—dedicated hardware for bulk data movement between HBM and shared memory. TMA offloads address computation from SMs, enabling asynchronous tile transfers.

HBM
Global tensors
TMA Unit
Address gen + transfer
SMEM
Ready for compute

TMA vs Manual Loads

Aspect Manual (LDG/STG) TMA
Address computation SM cycles TMA unit (free)
Tile dimensions Manual indexing Descriptor-based
Multicast Not available Built-in to multiple SMs
Async cp.async Native, with barriers
When to Use TMA

TMA shines for structured tile access patterns (GEMM, convolution, attention). For irregular access, manual loads may still be necessary. CuTe and Triton abstract TMA through their copy operations.

05 — CHECKLIST

Optimization Priority Order

Memory optimization follows a hierarchy of impact. Address these in order:

1

Memory Coalescing

Adjacent threads access adjacent addresses. Single biggest impact on bandwidth.

2

Data Reuse (Tiling)

Load once from HBM, use many times from SMEM. Transforms memory-bound to compute-bound.

3

Occupancy

Enough warps to hide latency. Balance registers/SMEM against parallelism.

4

Bank Conflicts

Avoid same-bank access in SMEM. Pad arrays or use different strides.

5

Tensor Cores + TMA

Hardware acceleration. Requires proper alignment and data types.

Performance Targets

Metric Good Great
Memory Bandwidth 60% 80%+
Compute Throughput 50% 70%+
vs cuBLAS 50% 80%+
06 — PRACTICE

Hands-On Labs

Part 2 Labs

  1. Profiling - Learn Nsight Compute
  2. Coalescing - Optimize memory access patterns
  3. Bank Conflicts - Eliminate shared memory bottlenecks
  4. Pipelining - Overlap compute and memory
  5. TMA (Hopper+) - Hardware-accelerated data movement
  6. Tensor Cores - MMA operations
  7. Optimized GEMM - Put it all together

References