Phase V · Week 10 · Day 64 of 70 · 2.5 hours
"A single GPU can train GPT-2 in a week. GPT-4 required 25,000 GPUs for 3 months. Distributed training isn't optional — it's the only way modern AI exists."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 63: Stop & Reflect #5 | Day 65: Data Parallel & FSDP | Week 10: Distributed Training & Capstone | Phase V: Training at Scale | ML Compilers |
Training a 70B parameter model with FP16 requires 140 GB just for the weights — already exceeding any single GPU's memory. Add optimizer states (2× for Adam), gradients (1×), and activations, and you need ~1 TB for training. Beyond memory, compute scaling laws (Kaplan et al., Chinchilla) show that optimal training requires compute budgets of $C \approx 6ND$ FLOPs, where $N$ is parameter count and $D$ is token count. For frontier models, this translates to $10^{24}$–$10^{25}$ FLOPs — physically impossible on a single device. Understanding distributed training fundamentals — parallelism strategies, collective communication, and their overhead — is essential for anyone working on training systems, compiler optimizations for training, or ML infrastructure.
The Chinchilla scaling law tells us the optimal compute budget:
$$C_{\text{optimal}} \approx 6 \times N \times D$$
where $N$ = parameters, $D$ = training tokens.
| Model | Params ($N$) | Tokens ($D$) | Compute ($C$) | A100 GPU-hours |
|---|---|---|---|---|
| GPT-2 | 1.5B | 40B | $3.6 \times 10^{20}$ | ~460 |
| LLaMA-2 7B | 7B | 2T | $8.4 \times 10^{22}$ | ~180K |
| LLaMA-2 70B | 70B | 2T | $8.4 \times 10^{23}$ | ~1.7M |
| GPT-4 (est.) | 1.8T | 13T | $1.4 \times 10^{26}$ | ~200M |
For a model with $N$ parameters in mixed precision:
Training Memory Breakdown (Mixed Precision + Adam)
═══════════════════════════════════════════════════
Component Size Example (7B model)
───────────────────────────────────────────────────────────
Weights (FP16) 2N bytes 14 GB
Gradients (FP16) 2N bytes 14 GB
Adam optimizer states:
- FP32 weights copy 4N bytes 28 GB
- First moment (m) 4N bytes 28 GB
- Second moment (v) 4N bytes 28 GB
───────────────────────────────────────────────────────────
Subtotal (no acts) 16N bytes 112 GB
Activations varies 50-200+ GB
───────────────────────────────────────────────────────────
Total 162-312 GB
Single A100 has 80 GB → Need at least 4 GPUs for 7B!
Parallelism Taxonomy
═══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────┐
│ Full Training Iteration │
│ Data × Model ──► Forward ──► Loss ──► Backward ──► Update│
└─────────────────────────────────────────────────────────┘
DATA PARALLELISM: Split the data, replicate the model
─────────────────────────────────────────────────────
GPU 0: [Batch 0] × [Full Model] ──► grad₀ ─┐
GPU 1: [Batch 1] × [Full Model] ──► grad₁ ─┤──AllReduce──► avg grad
GPU 2: [Batch 2] × [Full Model] ──► grad₂ ─┤ → update all
GPU 3: [Batch 3] × [Full Model] ──► grad₃ ─┘
MODEL PARALLELISM (Tensor): Split layers across devices
─────────────────────────────────────────────────────────
GPU 0: [Full Batch] × [Layer cols 0:H/2] ─┐
├─ combine ──► next layer
GPU 1: [Full Batch] × [Layer cols H/2:H] ─┘
PIPELINE PARALLELISM: Split layers sequentially
─────────────────────────────────────────────────
GPU 0: [Layers 0-7] ──output──► GPU 1: [Layers 8-15]
──output──► GPU 2: [Layers 16-23]
──output──► GPU 3: [Layers 24-31]
| Strategy | Memory per GPU | Communication | Scaling | Best For |
|---|---|---|---|---|
| Data Parallel | Full model | Gradients (AllReduce) | Batch size | Model fits in 1 GPU |
| Tensor Parallel | Partial layer | Activations (AllReduce per layer) | Per-layer | Large layers |
| Pipeline Parallel | Subset of layers | Activations (point-to-point) | Depth | Many layers |
The backbone of distributed training is collective communication — operations that move data between all participating GPUs.
Collective Operations (4 GPUs, each has a vector)
═══════════════════════════════════════════════════════════════
BROADCAST: One-to-all
─────────────────────
GPU 0: [A] GPU 0: [A]
GPU 1: [ ] → GPU 1: [A]
GPU 2: [ ] GPU 2: [A]
GPU 3: [ ] GPU 3: [A]
REDUCE: All-to-one (sum/avg)
────────────────────────────
GPU 0: [A₀] GPU 0: [A₀+A₁+A₂+A₃]
GPU 1: [A₁] → GPU 1: [ ]
GPU 2: [A₂] GPU 2: [ ]
GPU 3: [A₃] GPU 3: [ ]
ALL-REDUCE: Reduce + Broadcast (every GPU gets result)
──────────────────────────────────────────────────────
GPU 0: [A₀] GPU 0: [A₀+A₁+A₂+A₃]
GPU 1: [A₁] → GPU 1: [A₀+A₁+A₂+A₃]
GPU 2: [A₂] GPU 2: [A₀+A₁+A₂+A₃]
GPU 3: [A₃] GPU 3: [A₀+A₁+A₂+A₃]
ALL-GATHER: Each sends its piece, all get everything
────────────────────────────────────────────────────
GPU 0: [A₀] GPU 0: [A₀|A₁|A₂|A₃]
GPU 1: [A₁] → GPU 1: [A₀|A₁|A₂|A₃]
GPU 2: [A₂] GPU 2: [A₀|A₁|A₂|A₃]
GPU 3: [A₃] GPU 3: [A₀|A₁|A₂|A₃]
REDUCE-SCATTER: Reduce + Scatter (each gets a shard of result)
──────────────────────────────────────────────────────────────
GPU 0: [a₀|b₀|c₀|d₀] GPU 0: [Σaᵢ]
GPU 1: [a₁|b₁|c₁|d₁] → GPU 1: [Σbᵢ]
GPU 2: [a₂|b₂|c₂|d₂] GPU 2: [Σcᵢ]
GPU 3: [a₃|b₃|c₃|d₃] GPU 3: [Σdᵢ]
$$\text{AllReduce}(x) = \text{AllGather}(\text{ReduceScatter}(x))$$
This decomposition is why FSDP works — you can perform half the communication during backward (ReduceScatter gradients) and the other half just-in-time during forward (AllGather parameters).
The naïve AllReduce sends all data to one node and broadcasts back — $O(N \cdot P)$ traffic through the root. Ring AllReduce distributes the work evenly.
Ring AllReduce (4 GPUs, data split into 4 chunks)
═══════════════════════════════════════════════════════════════
Phase 1: REDUCE-SCATTER (P-1 = 3 steps)
Each GPU sends one chunk clockwise, accumulates received chunk
Step 1: Step 2: Step 3:
GPU0 ──c0──► GPU1 GPU0 ──c3──► GPU1 GPU0 ──c2──► GPU1
GPU1 ──c1──► GPU2 GPU1 ──c0──► GPU2 GPU1 ──c3──► GPU2
GPU2 ──c2──► GPU3 GPU2 ──c1──► GPU3 GPU2 ──c0──► GPU3
GPU3 ──c3──► GPU0 GPU3 ──c2──► GPU0 GPU3 ──c1──► GPU0
After ReduceScatter: Each GPU holds the fully reduced chunk i
Phase 2: ALL-GATHER (P-1 = 3 steps)
Same ring pattern, but forward received chunks without accumulating
Total data transferred per GPU: 2 × (P-1)/P × N
────────────────────────────────────────────────
For P=4, each GPU sends 1.5N total (75% of full data)
For P=256, each GPU sends ~2N regardless of cluster size!
| Algorithm | Bandwidth Cost per GPU | Latency Cost |
|---|---|---|
| Naïve (tree) | $O(N)$ | $O(P \cdot \alpha)$ |
| Ring AllReduce | $2 \cdot \frac{P-1}{P} \cdot N$ | $2(P-1) \cdot \alpha$ |
| Recursive Halving-Doubling | $\frac{P-1}{P} \cdot N$ | $2 \log_2(P) \cdot \alpha$ |
Where $N$ = message size, $P$ = number of GPUs, $\alpha$ = per-message latency.
Key: Ring AllReduce is bandwidth-optimal — total traffic is independent of $P$ for large messages. But latency grows linearly. For small messages, tree-based algorithms with $O(\log P)$ latency are better.
NCCL (NVIDIA Collective Communications Library) is the de facto standard for GPU-to-GPU communication.
import torch
import torch.distributed as dist
import os
def setup_distributed():
"""Initialize distributed training with NCCL backend."""
# Environment variables set by torchrun / SLURM
rank = int(os.environ["RANK"]) # Global rank (0..world_size-1)
local_rank = int(os.environ["LOCAL_RANK"]) # Rank within this node
world_size = int(os.environ["WORLD_SIZE"])
# Initialize process group — NCCL for GPU communication
dist.init_process_group(
backend="nccl", # NCCL for GPU, gloo for CPU
rank=rank,
world_size=world_size,
)
# Bind this process to its GPU
torch.cuda.set_device(local_rank)
return rank, local_rank, world_size
def demonstrate_collectives(rank, world_size):
"""Show basic collective operations."""
device = torch.device(f"cuda:{rank}")
# Each GPU creates its own tensor
tensor = torch.tensor([rank * 10.0 + 1, rank * 10.0 + 2], device=device)
print(f"[Rank {rank}] Before AllReduce: {tensor}")
# AllReduce — sum across all GPUs (in-place)
dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
print(f"[Rank {rank}] After AllReduce(SUM): {tensor}")
# All GPUs now have: [0*10+1 + 1*10+1 + ... , 0*10+2 + 1*10+2 + ...]
# AllGather — collect tensors from all GPUs
gathered = [torch.zeros(2, device=device) for _ in range(world_size)]
my_tensor = torch.tensor([rank * 1.0, rank * 2.0], device=device)
dist.all_gather(gathered, my_tensor)
print(f"[Rank {rank}] AllGather result: {gathered}")
# Broadcast — GPU 0 sends to all
data = torch.tensor([42.0, 43.0], device=device) if rank == 0 \
else torch.zeros(2, device=device)
dist.broadcast(data, src=0)
print(f"[Rank {rank}] After Broadcast from 0: {data}")
if __name__ == "__main__":
rank, local_rank, world_size = setup_distributed()
demonstrate_collectives(rank, world_size)
dist.destroy_process_group()
# Launch with: torchrun --nproc_per_node=4 this_script.py
NCCL automatically detects the GPU interconnect topology and chooses optimal algorithms:
GPU Interconnect Hierarchy (typical 8-GPU node)
═══════════════════════════════════════════════════════════════
Within a node:
┌───────────────────────────────────────────────┐
│ GPU 0 ══NVLink══ GPU 1 GPU 4 ══NVLink══ GPU 5 │
│ ║ ║ ║ ║ │
│ NVLink NVLink NVLink NVLink │
│ ║ ║ ║ ║ │
│ GPU 2 ══NVLink══ GPU 3 GPU 6 ══NVLink══ GPU 7 │
│ └──NVSwitch (900 GB/s bisection)──┘ │
└───────────────────────────────────────────────┘
Across nodes:
Node 0 ══InfiniBand (400 Gb/s)══ Node 1
══InfiniBand══ Node 2
══InfiniBand══ Node 3
Bandwidth hierarchy:
NVLink 4 (H100): 900 GB/s bidirectional
PCIe Gen5: 64 GB/s per direction
InfiniBand NDR: 50 GB/s (400 Gbps)
Ethernet 100GbE: 12.5 GB/s
The key question: how much time does communication add?
Timeline: Data-Parallel Training Step
═══════════════════════════════════════════════════════════════
WITHOUT overlap:
|── Forward ──|── Backward ──|── AllReduce ──|── Update ──|
| T_f | T_b | T_c | T_u |
Total = T_f + T_b + T_c + T_u
WITH overlap (DDP approach):
|── Forward ──|──── Backward ─────────────|── Update ──|
| grad_N ──► AllReduce_N |
| grad_N-1 ► AllReduce_N-1 |
| grad_N-2 ► AllReduce_N-2 |
| ...overlapped... |
Total ≈ T_f + max(T_b, T_c) + T_u ← Communication hidden!
The computation-to-communication ratio determines scaling efficiency:
$$\text{Efficiency} = \frac{T_{\text{compute}}}{T_{\text{compute}} + T_{\text{comm}}}$$
For data parallelism with AllReduce on gradient size $G$:
$$T_{\text{comm}} \approx 2 \cdot \frac{P-1}{P} \cdot \frac{G}{\text{bandwidth}}$$
For a 7B model ($G = 14$ GB in FP16) across 8 GPUs on InfiniBand (50 GB/s):
$$T_{\text{comm}} \approx 2 \times \frac{7}{8} \times \frac{14}{50} \approx 0.49 \text{ seconds}$$
If one training step takes 2 seconds of compute, efficiency = $\frac{2.0}{2.0 + 0.49} \approx 80\%$.
With computation-communication overlap, efficiency approaches 95%+.
Write a script that demonstrates AllReduce, AllGather, ReduceScatter, and Broadcast using torch.distributed. Run with torchrun --nproc_per_node=2. Verify that AllReduce equals ReduceScatter followed by AllGather.
Implement ring AllReduce in pure Python (no GPU needed) to understand the algorithm:
import numpy as np
def ring_allreduce(data_per_gpu: list[np.ndarray]) -> list[np.ndarray]:
"""
Simulate ring AllReduce on CPU.
Args:
data_per_gpu: list of arrays, one per simulated GPU
Returns:
list of arrays, each containing the sum of all inputs
"""
P = len(data_per_gpu)
N = len(data_per_gpu[0])
assert N % P == 0, "Data size must be divisible by GPU count"
chunk_size = N // P
# Split each GPU's data into P chunks
chunks = [np.split(d, P) for d in data_per_gpu]
# Phase 1: Reduce-Scatter (P-1 steps)
for step in range(P - 1):
for gpu in range(P):
send_idx = (gpu - step) % P
recv_idx = (gpu - step - 1) % P
sender = (gpu - 1) % P
# GPU receives chunk[recv_idx] from its left neighbor
# and accumulates it
# YOUR CODE: chunks[gpu][recv_idx] += chunks[sender][recv_idx]
pass # TODO
# Phase 2: All-Gather (P-1 steps)
for step in range(P - 1):
for gpu in range(P):
# Forward the fully-reduced chunk to the right
# YOUR CODE
pass # TODO
# Reconstruct full arrays
return [np.concatenate(chunks[gpu]) for gpu in range(P)]
# Test
data = [np.array([1.0, 2.0, 3.0, 4.0]) * (i + 1) for i in range(4)]
result = ring_allreduce(data)
expected = sum(data)
for i, r in enumerate(result):
assert np.allclose(r, expected), f"GPU {i}: {r} != {expected}"
print(f"GPU {i}: {r} ✓")
Calculate the communication-to-computation ratio for: - 7B model on 8× A100 (NVLink, 900 GB/s) - 70B model on 64× H100 across 8 nodes (InfiniBand, 400 Gbps) - 1T model on 4096× H100 across 512 nodes
Determine where data parallelism alone breaks down and other strategies are needed.
You've learned the communication primitives. Tomorrow we'll see how DDP and FSDP build on them — and how FSDP's ZeRO-style sharding cuts per-GPU memory from $16N$ bytes to as low as $\frac{16N}{P} + 2N$ bytes, making it possible to train 70B+ models on consumer-grade GPU clusters.