Chapter 11: Multi-GPU Programming

01 - WHY MULTI-GPU

Scaling Beyond One GPU

Large models don't fit on a single GPU. Even when they do, training is too slow. Multi-GPU training addresses both problems, but introduces communication overhead.

The Scale Problem

LLaMA-70B Weights

140

GB (FP16)

H100 Memory

80

GB HBM3

Training Memory

~4-8x

of weights

GPUs Needed

8+

for LLaMA-70B

The LLaMA paper trained 65B parameters on 2048 A100 GPUs for 21 days. Scaling efficiently across thousands of GPUs is essential for frontier model training.

Scaling Efficiency

Perfect scaling means N GPUs give N× throughput. Reality is worse due to:

Communication overhead — time spent synchronizing gradients
Bubble time — GPUs waiting for data from other GPUs
Memory overhead — duplicate buffers, communication buffers
Load imbalance — some GPUs finish before others

Scaling Efficiency Formula

Efficiency = (N × single_GPU_throughput) / actual_throughput
Good systems achieve 90%+ efficiency at 100s of GPUs. Megatron-LM reports 52% efficiency at 3072 GPUs for GPT-3 scale models.

A 70B parameter model in FP16 needs ~140GB for weights alone. With Adam optimizer states (2x weights) and gradients, what's the minimum memory for training?

280 GB (weights + gradients)

560 GB (weights + gradients + 2x optimizer states)

140 GB (just weights)

02 - PARALLELISM TYPES

Data, Tensor, and Pipeline

There are three fundamental ways to distribute work across GPUs. Modern large-scale training combines all three in what's called 3D parallelism.

Same model on each GPU, different data batches. Gradients synchronized after backward.

Batch 0

Batch 1

Batch 2

Batch 3

Pro: Simple, scales batch size
Con: Model must fit on one GPU

Split individual layers across GPUs. Each GPU computes part of each layer.

Col 0

Col 1

Col 2

Col 3

Pro: Splits large layers
Con: High communication every layer

Different layers on different GPUs. Data flows through GPUs sequentially.

L0-L7

L8-L15

L16-L23

L24-L31

Pro: Low communication
Con: Pipeline bubbles, memory imbalance

When to Use Each

Strategy	Best For	Communication Pattern	Typical Scale
Data Parallel	Model fits on 1 GPU	AllReduce after each step	Any number of GPUs
Tensor Parallel	Large layers (big hidden dim)	AllReduce after each layer	2-8 GPUs (NVLink required)
Pipeline Parallel	Many layers	P2P between stages	4-64 GPUs per pipeline

3D Parallelism

Large-scale training uses all three: tensor parallelism within a node (NVLink), pipeline parallelism across nodes, and data parallelism across replicas. Megatron-Turing NLG uses (TP=8) × (PP=8) × (DP=48) = 3072 GPUs.

Tensor parallelism requires AllReduce after every layer. Why is NVLink important for tensor parallelism?

NVLink provides ~10x more bandwidth than PCIe for frequent communication

NVLink is required for multi-GPU—PCIe doesn't support it

Tensor parallelism only works with NVLink hardware

03 - NCCL PRIMITIVES

Collective Communication Operations

NCCL (NVIDIA Collective Communications Library) provides optimized primitives for GPU-to-GPU communication. Understanding these operations is essential for distributed training.

AllReduce

The most common operation: sum gradients across all GPUs, result available on all GPUs.

AllReduce (Sum)

Before

g0

g1

g2

g3

↓ AllReduce ↓

After

Σg

AllGather

Concatenate data from all GPUs onto all GPUs. Used in tensor parallelism to reconstruct tensors.

AllGather

Before

A

B

C

D

↓ AllGather ↓

After

ABCD

Each GPU has the complete concatenated result

ReduceScatter

Reduce then scatter: each GPU gets a different portion of the reduced result. This is the key operation in ZeRO Stage 2.

ReduceScatter

Before

g0

g1

g2

g3

↓ ReduceScatter ↓

After

Σ[0]

Σ[1]

Σ[2]

Σ[3]

Each GPU has 1/N of the summed result

Communication Cost

Operation	Data Transferred per GPU	Use Case
`AllReduce`	2 × data_size × (N-1)/N	Gradient sync (DDP)
`AllGather`	data_size × (N-1)/N	Tensor parallel gather
`ReduceScatter`	data_size × (N-1)/N	ZeRO gradient reduce
`Broadcast`	data_size	Send from one to all

Ring AllReduce

NCCL implements AllReduce using a ring algorithm that achieves optimal bandwidth utilization. Each GPU sends/receives 2 × data_size total, regardless of the number of GPUs—making it highly scalable.

AllReduce transfers 2 × data_size × (N-1)/N per GPU. For large N, what does this approach?

2 × data_size (constant, independent of N)

N × data_size (linear in N)

data_size (just the gradient size)

04 - HARDWARE TOPOLOGY

NVLink, NVSwitch, and Interconnects

Communication bandwidth determines which parallelism strategies are practical. GPU interconnect technology varies dramatically in bandwidth.

Interconnect Bandwidth

Interconnect	Bandwidth (bidirectional)	Typical Use
NVLink 4.0 (H100)	900 GB/s	Intra-node GPU-to-GPU
NVLink 3.0 (A100)	600 GB/s	Intra-node GPU-to-GPU
PCIe 5.0 x16	~64 GB/s	GPU to CPU, older systems
InfiniBand HDR	~50 GB/s	Inter-node
100 GbE	~12.5 GB/s	Ethernet clusters

DGX H100 Topology

The DGX H100 uses NVSwitch to provide full-bisection bandwidth between all 8 GPUs:

GPU 0

GPU 1

GPU 2

GPU 3

NVSwitch

GPU 4

GPU 5

GPU 6

GPU 7

Any-to-any: 900 GB/s per GPU

Topology Matters

Without NVSwitch, GPU pairs connect via NVLink but some pairs must route through intermediate GPUs. This creates topology-dependent performance—tensor parallel degree should match the number of GPUs with direct NVLink connections.

Multi-Node Considerations

Inter-node bandwidth (InfiniBand/Ethernet) is 10-50× slower than NVLink. This constrains which parallelism works across nodes:

Data parallelism — works well (gradient sync once per step)
Pipeline parallelism — works well (point-to-point only)
Tensor parallelism — avoid across nodes (communication every layer)

NVLink provides 900 GB/s, InfiniBand provides ~50 GB/s. Why avoid tensor parallelism across nodes?

Tensor parallelism doesn't work over network—only NVLink

~18x lower bandwidth makes per-layer AllReduce too slow

InfiniBand doesn't support collective operations

05 - DATA PARALLEL

Implementing DDP

DistributedDataParallel (DDP) is the workhorse of multi-GPU training. Each GPU has a full model copy, processes different data, and synchronizes gradients.

Basic DDP Setup

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group (called once per process)
dist.init_process_group(backend="nccl")

# Get local rank (which GPU this process uses)
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Create model and wrap with DDP
model = MyModel().cuda()
model = DDP(model, device_ids=[local_rank])

# Create distributed sampler for data
train_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(
    train_dataset,
    sampler=train_sampler,
    batch_size=per_gpu_batch_size
)

# Training loop (gradients auto-synced by DDP)
for epoch in range(epochs):
    train_sampler.set_epoch(epoch)  # Shuffle per epoch
    for batch in train_loader:
        loss = model(batch)
        loss.backward()  # DDP syncs gradients here
        optimizer.step()
        optimizer.zero_grad()

How DDP Works

Bucket gradients — group gradients into ~25MB buckets for efficient AllReduce
Overlap compute and communication — AllReduce starts as soon as bucket is ready
Broadcast initial weights — rank 0 broadcasts weights to all ranks at start

# Single node, 8 GPUs
torchrun --nproc_per_node=8 train.py

# Multi-node (run on each node)
torchrun --nnodes=2 --nproc_per_node=8 \
    --rdzv_id=job1 --rdzv_backend=c10d \
    --rdzv_endpoint=master_ip:29500 train.py

Gradient Bucketing

DDP groups gradients into buckets for efficient AllReduce. Small AllReduces have high overhead—bucketing amortizes this. Default bucket size is 25MB, tunable via bucket_cap_mb.

Scaling Efficiency

DDP efficiency depends on the ratio of compute to communication:

Model Size	Gradient Size (FP16)	AllReduce Time (8×H100)	Typical Efficiency
1B params	2 GB	~4 ms	>95%
7B params	14 GB	~31 ms	~90%
70B params	140 GB	~311 ms	Needs ZeRO

DDP overlaps gradient AllReduce with backward computation. What enables this overlap?

GPU has separate compute and memory units

Gradient bucketing—AllReduce starts on filled buckets while backward continues

NCCL uses a separate CPU thread for communication

06 - ZERO OPTIMIZER

Memory-Efficient Data Parallelism

ZeRO (Zero Redundancy Optimizer) eliminates memory redundancy in data parallelism. Instead of each GPU storing full optimizer states, gradients, and parameters—they're partitioned.

ZeRO Stages

Baseline DDP — each GPU stores everything:

Weights

Gradients

Optimizer

Weights

Gradients

Optimizer

Weights

Gradients

Optimizer

Weights

Gradients

Optimizer

ZeRO Stage 1 — partition optimizer states:

Weights

Gradients

Opt 0

Weights

Gradients

Opt 1

Weights

Gradients

Opt 2

Weights

Gradients

Opt 3

ZeRO Stage 2 — partition gradients (ReduceScatter instead of AllReduce):

Weights

Grad 0

Opt 0

Weights

Grad 1

Opt 1

Weights

Grad 2

Opt 2

Weights

Grad 3

Opt 3

ZeRO Stage 3 — partition parameters (AllGather before each forward/backward):

W 0

Grad 0

Opt 0

W 1

Grad 1

Opt 1

W 2

Grad 2

Opt 2

W 3

Grad 3

Opt 3

Memory Savings

Stage	Memory per GPU (N GPUs, FP16+FP32 opt)	Communication
Baseline DDP	2Ψ + 2Ψ + 12Ψ = 16Ψ	AllReduce
ZeRO-1	2Ψ + 2Ψ + 12Ψ/N = 4Ψ + 12Ψ/N	AllReduce
ZeRO-2	2Ψ + 2Ψ/N + 12Ψ/N = 2Ψ + 14Ψ/N	ReduceScatter + AllGather
ZeRO-3	16Ψ/N	AllGather (2x per layer)

Ψ = model parameters. Adam FP32 optimizer states = 12 bytes/param (momentum + variance + master weight).

import deepspeed

# Config enables ZeRO Stage 2
ds_config = {
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"},  # Optional CPU offload
        "contiguous_gradients": True,
        "overlap_comm": True
    },
    "fp16": {"enabled": True},
    "train_batch_size": 32
}

model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config
)

ZeRO-Offload

ZeRO-Offload extends ZeRO by offloading optimizer states and computation to CPU memory. This enables training 10B+ models on a single GPU, at the cost of CPU-GPU transfer overhead.

ZeRO Stage 3 reduces memory to 16Ψ/N per GPU. What's the trade-off?

Lower throughput due to CPU offload

AllGather needed before each layer's forward and backward

Only works with specific model architectures

07 - TENSOR PARALLELISM

Splitting Layers Across GPUs

Megatron-LM style tensor parallelism splits individual linear layers across GPUs. This is essential for models where a single layer doesn't fit in GPU memory.

Column-Parallel Linear

Split the weight matrix along the output dimension. Each GPU computes part of the output.

# Weight W shape: (in_features, out_features)
# Split into W = [W_0, W_1] along columns

# GPU 0: A_0 = X @ W_0   (all of X, part of W)
# GPU 1: A_1 = X @ W_1   (all of X, part of W)

# Result: A = [A_0, A_1] (concatenated)
# Requires AllGather to get full A on all GPUs (if needed)

Row-Parallel Linear

Split the weight matrix along the input dimension. Requires input to be partitioned.

# Weight W shape: (in_features, out_features)
# Split into W = [W_0; W_1] along rows
# Input A = [A_0, A_1] partitioned across GPUs

# GPU 0: B_0 = A_0 @ W_0  (partial product)
# GPU 1: B_1 = A_1 @ W_1  (partial product)

# Result: B = B_0 + B_1 (requires AllReduce)

Transformer Tensor Parallelism

In transformers, column-parallel and row-parallel layers are combined to minimize communication:

Layer	Parallelism Type	Communication
QKV projection	Column parallel	None (input broadcast)
Attention output	Row parallel	AllReduce after
FFN first linear	Column parallel	None
FFN second linear	Row parallel	AllReduce after

This pattern requires only 2 AllReduces per transformer layer (one after attention, one after FFN).

Tensor Parallelism Degree

Tensor parallel degree is typically 2, 4, or 8—matching GPUs connected by NVLink. Higher degrees have diminishing returns: more communication, smaller per-GPU matrices (worse GPU utilization).

Megatron-style tensor parallelism requires 2 AllReduces per transformer layer. Where do they occur?

After attention output projection and after FFN

Before and after each linear layer

Only during backward pass for gradient sync

08 - SUMMARY

Key Takeaways

Data parallel — simplest, scales batch size, model must fit on 1 GPU
Tensor parallel — splits layers, requires NVLink, typically 2-8 GPUs
Pipeline parallel — splits by layers, works across nodes, has bubble overhead
3D parallelism — combine all three for maximum scale

AllReduce — sum to all, used in DDP gradient sync
AllGather — concat to all, used in tensor parallel and ZeRO-3
ReduceScatter — sum and partition, used in ZeRO-2

ZeRO-1 — partition optimizer states (~4x memory savings)
ZeRO-2 — + partition gradients (~8x savings)
ZeRO-3 — + partition parameters (~N× savings, more communication)