Chapter 11

Multi-GPU Programming

Scaling training across multiple GPUs with data parallelism, tensor parallelism, and communication primitives. Understanding the topology and trade-offs that determine whether scaling actually speeds things up.

What You'll Learn
  1. Explain data, tensor, and pipeline parallelism trade-offs
  2. Use NCCL primitives for GPU-to-GPU communication
  3. Understand NVLink/NVSwitch topology and bandwidth implications
  4. Implement basic distributed data parallel training
  5. Describe ZeRO optimization stages and their memory savings
01 - WHY MULTI-GPU

Scaling Beyond One GPU

Large models don't fit on a single GPU. Even when they do, training is too slow. Multi-GPU training addresses both problems, but introduces communication overhead.

The Scale Problem

LLaMA-70B Weights
140
GB (FP16)
H100 Memory
80
GB HBM3
Training Memory
~4-8x
of weights
GPUs Needed
8+
for LLaMA-70B

The LLaMA paper trained 65B parameters on 2048 A100 GPUs for 21 days. Scaling efficiently across thousands of GPUs is essential for frontier model training.

Scaling Efficiency

Perfect scaling means N GPUs give N× throughput. Reality is worse due to:

Scaling Efficiency Formula

Efficiency = (N × single_GPU_throughput) / actual_throughput
Good systems achieve 90%+ efficiency at 100s of GPUs. Megatron-LM reports 52% efficiency at 3072 GPUs for GPT-3 scale models.

A 70B parameter model in FP16 needs ~140GB for weights alone. With Adam optimizer states (2x weights) and gradients, what's the minimum memory for training?
280 GB (weights + gradients)
560 GB (weights + gradients + 2x optimizer states)
140 GB (just weights)

02 - PARALLELISM TYPES

Data, Tensor, and Pipeline

There are three fundamental ways to distribute work across GPUs. Modern large-scale training combines all three in what's called 3D parallelism.

Data Parallelism

Same model on each GPU, different data batches. Gradients synchronized after backward.

Batch 0
Batch 1
Batch 2
Batch 3

Pro: Simple, scales batch size
Con: Model must fit on one GPU

Tensor Parallelism

Split individual layers across GPUs. Each GPU computes part of each layer.

Col 0
Col 1
Col 2
Col 3

Pro: Splits large layers
Con: High communication every layer

Pipeline Parallelism

Different layers on different GPUs. Data flows through GPUs sequentially.

L0-L7
L8-L15
L16-L23
L24-L31

Pro: Low communication
Con: Pipeline bubbles, memory imbalance

When to Use Each

Strategy Best For Communication Pattern Typical Scale
Data Parallel Model fits on 1 GPU AllReduce after each step Any number of GPUs
Tensor Parallel Large layers (big hidden dim) AllReduce after each layer 2-8 GPUs (NVLink required)
Pipeline Parallel Many layers P2P between stages 4-64 GPUs per pipeline
3D Parallelism

Large-scale training uses all three: tensor parallelism within a node (NVLink), pipeline parallelism across nodes, and data parallelism across replicas. Megatron-Turing NLG uses (TP=8) × (PP=8) × (DP=48) = 3072 GPUs.

Tensor parallelism requires AllReduce after every layer. Why is NVLink important for tensor parallelism?
NVLink provides ~10x more bandwidth than PCIe for frequent communication
NVLink is required for multi-GPU—PCIe doesn't support it
Tensor parallelism only works with NVLink hardware

03 - NCCL PRIMITIVES

Collective Communication Operations

NCCL (NVIDIA Collective Communications Library) provides optimized primitives for GPU-to-GPU communication. Understanding these operations is essential for distributed training.

AllReduce

The most common operation: sum gradients across all GPUs, result available on all GPUs.

AllReduce (Sum)
Before
g0
g1
g2
g3
↓ AllReduce ↓
After
Σg
Σg
Σg
Σg

AllGather

Concatenate data from all GPUs onto all GPUs. Used in tensor parallelism to reconstruct tensors.

AllGather
Before
A
B
C
D
↓ AllGather ↓
After
ABCD
Each GPU has the complete concatenated result

ReduceScatter

Reduce then scatter: each GPU gets a different portion of the reduced result. This is the key operation in ZeRO Stage 2.

ReduceScatter
Before
g0
g1
g2
g3
↓ ReduceScatter ↓
After
Σ[0]
Σ[1]
Σ[2]
Σ[3]
Each GPU has 1/N of the summed result

Communication Cost

Operation Data Transferred per GPU Use Case
AllReduce 2 × data_size × (N-1)/N Gradient sync (DDP)
AllGather data_size × (N-1)/N Tensor parallel gather
ReduceScatter data_size × (N-1)/N ZeRO gradient reduce
Broadcast data_size Send from one to all
Ring AllReduce

NCCL implements AllReduce using a ring algorithm that achieves optimal bandwidth utilization. Each GPU sends/receives 2 × data_size total, regardless of the number of GPUs—making it highly scalable.

AllReduce transfers 2 × data_size × (N-1)/N per GPU. For large N, what does this approach?
2 × data_size (constant, independent of N)
N × data_size (linear in N)
data_size (just the gradient size)

04 - HARDWARE TOPOLOGY

NVLink, NVSwitch, and Interconnects

Communication bandwidth determines which parallelism strategies are practical. GPU interconnect technology varies dramatically in bandwidth.

Interconnect Bandwidth

Interconnect Bandwidth (bidirectional) Typical Use
NVLink 4.0 (H100) 900 GB/s Intra-node GPU-to-GPU
NVLink 3.0 (A100) 600 GB/s Intra-node GPU-to-GPU
PCIe 5.0 x16 ~64 GB/s GPU to CPU, older systems
InfiniBand HDR ~50 GB/s Inter-node
100 GbE ~12.5 GB/s Ethernet clusters

DGX H100 Topology

The DGX H100 uses NVSwitch to provide full-bisection bandwidth between all 8 GPUs:

GPU 0
GPU 1
GPU 2
GPU 3
NVSwitch
GPU 4
GPU 5
GPU 6
GPU 7
Any-to-any: 900 GB/s per GPU
Topology Matters

Without NVSwitch, GPU pairs connect via NVLink but some pairs must route through intermediate GPUs. This creates topology-dependent performance—tensor parallel degree should match the number of GPUs with direct NVLink connections.

Multi-Node Considerations

Inter-node bandwidth (InfiniBand/Ethernet) is 10-50× slower than NVLink. This constrains which parallelism works across nodes:

NVLink provides 900 GB/s, InfiniBand provides ~50 GB/s. Why avoid tensor parallelism across nodes?
Tensor parallelism doesn't work over network—only NVLink
~18x lower bandwidth makes per-layer AllReduce too slow
InfiniBand doesn't support collective operations

05 - DATA PARALLEL

Implementing DDP

DistributedDataParallel (DDP) is the workhorse of multi-GPU training. Each GPU has a full model copy, processes different data, and synchronizes gradients.

Basic DDP Setup

PyTorch DDP
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group (called once per process)
dist.init_process_group(backend="nccl")

# Get local rank (which GPU this process uses)
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Create model and wrap with DDP
model = MyModel().cuda()
model = DDP(model, device_ids=[local_rank])

# Create distributed sampler for data
train_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(
    train_dataset,
    sampler=train_sampler,
    batch_size=per_gpu_batch_size
)

# Training loop (gradients auto-synced by DDP)
for epoch in range(epochs):
    train_sampler.set_epoch(epoch)  # Shuffle per epoch
    for batch in train_loader:
        loss = model(batch)
        loss.backward()  # DDP syncs gradients here
        optimizer.step()
        optimizer.zero_grad()

How DDP Works

  1. Bucket gradients — group gradients into ~25MB buckets for efficient AllReduce
  2. Overlap compute and communication — AllReduce starts as soon as bucket is ready
  3. Broadcast initial weights — rank 0 broadcasts weights to all ranks at start
Launch with torchrun
# Single node, 8 GPUs
torchrun --nproc_per_node=8 train.py

# Multi-node (run on each node)
torchrun --nnodes=2 --nproc_per_node=8 \
    --rdzv_id=job1 --rdzv_backend=c10d \
    --rdzv_endpoint=master_ip:29500 train.py
Gradient Bucketing

DDP groups gradients into buckets for efficient AllReduce. Small AllReduces have high overhead—bucketing amortizes this. Default bucket size is 25MB, tunable via bucket_cap_mb.

Scaling Efficiency

DDP efficiency depends on the ratio of compute to communication:

Model Size Gradient Size (FP16) AllReduce Time (8×H100) Typical Efficiency
1B params 2 GB ~4 ms >95%
7B params 14 GB ~31 ms ~90%
70B params 140 GB ~311 ms Needs ZeRO
DDP overlaps gradient AllReduce with backward computation. What enables this overlap?
GPU has separate compute and memory units
Gradient bucketing—AllReduce starts on filled buckets while backward continues
NCCL uses a separate CPU thread for communication

06 - ZERO OPTIMIZER

Memory-Efficient Data Parallelism

ZeRO (Zero Redundancy Optimizer) eliminates memory redundancy in data parallelism. Instead of each GPU storing full optimizer states, gradients, and parameters—they're partitioned.

ZeRO Stages

Baseline DDP — each GPU stores everything:

GPU 0
Weights
Gradients
Optimizer
GPU 1
Weights
Gradients
Optimizer
GPU 2
Weights
Gradients
Optimizer
GPU 3
Weights
Gradients
Optimizer

ZeRO Stage 1 — partition optimizer states:

GPU 0
Weights
Gradients
Opt 0
GPU 1
Weights
Gradients
Opt 1
GPU 2
Weights
Gradients
Opt 2
GPU 3
Weights
Gradients
Opt 3

ZeRO Stage 2 — partition gradients (ReduceScatter instead of AllReduce):

GPU 0
Weights
Grad 0
Opt 0
GPU 1
Weights
Grad 1
Opt 1
GPU 2
Weights
Grad 2
Opt 2
GPU 3
Weights
Grad 3
Opt 3

ZeRO Stage 3 — partition parameters (AllGather before each forward/backward):

GPU 0
W 0
Grad 0
Opt 0
GPU 1
W 1
Grad 1
Opt 1
GPU 2
W 2
Grad 2
Opt 2
GPU 3
W 3
Grad 3
Opt 3

Memory Savings

Stage Memory per GPU (N GPUs, FP16+FP32 opt) Communication
Baseline DDP 2Ψ + 2Ψ + 12Ψ = 16Ψ AllReduce
ZeRO-1 2Ψ + 2Ψ + 12Ψ/N = 4Ψ + 12Ψ/N AllReduce
ZeRO-2 2Ψ + 2Ψ/N + 12Ψ/N = 2Ψ + 14Ψ/N ReduceScatter + AllGather
ZeRO-3 16Ψ/N AllGather (2x per layer)

Ψ = model parameters. Adam FP32 optimizer states = 12 bytes/param (momentum + variance + master weight).

DeepSpeed ZeRO Stage 2
import deepspeed

# Config enables ZeRO Stage 2
ds_config = {
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"},  # Optional CPU offload
        "contiguous_gradients": True,
        "overlap_comm": True
    },
    "fp16": {"enabled": True},
    "train_batch_size": 32
}

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)
ZeRO-Offload

ZeRO-Offload extends ZeRO by offloading optimizer states and computation to CPU memory. This enables training 10B+ models on a single GPU, at the cost of CPU-GPU transfer overhead.

ZeRO Stage 3 reduces memory to 16Ψ/N per GPU. What's the trade-off?
Lower throughput due to CPU offload
AllGather needed before each layer's forward and backward
Only works with specific model architectures

07 - TENSOR PARALLELISM

Splitting Layers Across GPUs

Megatron-LM style tensor parallelism splits individual linear layers across GPUs. This is essential for models where a single layer doesn't fit in GPU memory.

Column-Parallel Linear

Split the weight matrix along the output dimension. Each GPU computes part of the output.

Column Parallel (A = XW, W split by columns)
# Weight W shape: (in_features, out_features)
# Split into W = [W_0, W_1] along columns

# GPU 0: A_0 = X @ W_0   (all of X, part of W)
# GPU 1: A_1 = X @ W_1   (all of X, part of W)

# Result: A = [A_0, A_1] (concatenated)
# Requires AllGather to get full A on all GPUs (if needed)

Row-Parallel Linear

Split the weight matrix along the input dimension. Requires input to be partitioned.

Row Parallel (B = AW, W split by rows, A partitioned)
# Weight W shape: (in_features, out_features)
# Split into W = [W_0; W_1] along rows
# Input A = [A_0, A_1] partitioned across GPUs

# GPU 0: B_0 = A_0 @ W_0  (partial product)
# GPU 1: B_1 = A_1 @ W_1  (partial product)

# Result: B = B_0 + B_1 (requires AllReduce)

Transformer Tensor Parallelism

In transformers, column-parallel and row-parallel layers are combined to minimize communication:

Layer Parallelism Type Communication
QKV projection Column parallel None (input broadcast)
Attention output Row parallel AllReduce after
FFN first linear Column parallel None
FFN second linear Row parallel AllReduce after

This pattern requires only 2 AllReduces per transformer layer (one after attention, one after FFN).

Tensor Parallelism Degree

Tensor parallel degree is typically 2, 4, or 8—matching GPUs connected by NVLink. Higher degrees have diminishing returns: more communication, smaller per-GPU matrices (worse GPU utilization).

Megatron-style tensor parallelism requires 2 AllReduces per transformer layer. Where do they occur?
After attention output projection and after FFN
Before and after each linear layer
Only during backward pass for gradient sync

08 - SUMMARY

Key Takeaways

Parallelism Strategies
  • Data parallel — simplest, scales batch size, model must fit on 1 GPU
  • Tensor parallel — splits layers, requires NVLink, typically 2-8 GPUs
  • Pipeline parallel — splits by layers, works across nodes, has bubble overhead
  • 3D parallelism — combine all three for maximum scale
NCCL Operations
  • AllReduce — sum to all, used in DDP gradient sync
  • AllGather — concat to all, used in tensor parallel and ZeRO-3
  • ReduceScatter — sum and partition, used in ZeRO-2
ZeRO Memory Efficiency
  • ZeRO-1 — partition optimizer states (~4x memory savings)
  • ZeRO-2 — + partition gradients (~8x savings)
  • ZeRO-3 — + partition parameters (~N× savings, more communication)