Phase I · Week 2 · Day 14 of 70 · 2.5 hours
"You don't really understand something until you can explain it without opening your notes."
| ← Previous | Next → | 📅 Week | �phase Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 13: Operator Fusion | Day 15: Compiler 101 for ML | Week 2: PyTorch Internals | Phase I: Foundations | Curriculum Home |
Two weeks and 14 days of dense material. Before we move into Phase II (Compiler Fundamentals), we need to consolidate. Research on learning shows that active recall, concept mapping, and identifying misconceptions produces 2–3× better retention than re-reading. This session is not optional downtime — it's where the material clicks into a coherent mental model.
Every concept from Days 1–13 connects. Trace the arrows to see how knowledge builds:
┌─────────────────────────────┐
│ ML Systems & Compilers │
│ WHY do we need them? │
└──────────────┬──────────────┘
│
┌──────────────────┴──────────────────┐
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Hardware │ │ Software │
│ Constraints │ │ Execution │
└──────┬──────┘ └──────┬──────┘
│ │
┌───────────┼───────────┐ ┌──────────┼──────────┐
│ │ │ │ │ │
┌────▼───┐ ┌────▼────┐ ┌────▼────┐ ┌─────▼────┐ ┌──▼───┐ ┌───▼───┐
│GPU Arch│ │ Memory │ │Compute │ │ Eager │ │Graph │ │Fusion │
│(Day1-3)│ │Hierarchy│ │ vs BW │ │ Mode │ │ Mode │ │(Day13)│
└────┬───┘ │(Day4-5) │ │(Day6-7) │ │ (Day12) │ │(Day12│ └───┬───┘
│ └────┬────┘ └────┬────┘ └─────┬────┘ └──┬───┘ │
│ │ │ │ │ │
│ ┌────▼────┐ │ ┌─────▼────┐ │ │
│ │ HBM │ │ │Dispatcher│ │ │
│ │ L2/SRAM │ │ │ per-op │ │ │
│ │Registers│ │ │ overhead │ │ │
│ └────┬────┘ │ └──────────┘ │ │
│ │ │ │ │
│ └─────┬─────┘ │ │
│ │ │ │
┌────▼────────────────▼──────┐ ┌──────────▼──────────▼──┐
│ Roofline Model │ │ Graph Capture │
│ AI = FLOPs / Bytes │ │ trace/script/fx/dynamo │
│ compute vs memory bound │ │ (Day 12) │
│ (Day 6-7) │ └───────────┬────────────┘
└────────────┬───────────────┘ │
│ │
└──────────────────┬───────────────────────┘
│
┌────────▼────────┐
│ FUSION │
│ Eliminate HBM │
│ round-trips by │
│ keeping interme-│
│ diates in regs │
│ (Day 13) │
└────────┬────────┘
│
┌────────▼────────┐
│ Profiling │
│ (Days 8-11) │
│ Measure before │
│ & after to │
│ validate gains │
└─────────────────┘
Hardware → Software path: GPU has limited memory bandwidth → element-wise ops are memory-bound → fusion reduces memory traffic → graph mode enables the compiler to find fusion opportunities
Measurement path: Profiling (nsight, torch.profiler) → identify bottleneck (compute vs memory) → Roofline model quantifies gap → fusion/optimization targets the gap → re-profile to validate
Abstraction path: Eager mode (Python-friendly) → graph capture (multiple methods) → IR representation → compiler optimizations → generated kernel code
These are the core frameworks you should now carry in your head:
Speed: Fastest ◄──────────────────────────────► Slowest
Size: Smallest ◄─────────────────────────────► Largest
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│Registers │ │ SRAM │ │ L2 │ │ HBM │
│ ~20 TB/s │ │ ~19 TB/s │ │ ~6 TB/s │ │ ~2 TB/s │
│ 256 KB │ │ 20 MB │ │ 40 MB │ │ 80 GB │
│(per SM) │ │(per SM: │ │ (shared) │ │ (global) │
│ │ │ 192 KB) │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
The goal of every optimization: keep data as far LEFT as possible.
Fusion keeps intermediates in registers instead of writing to HBM.
$$\text{Attainable FLOPS} = \min\left(\text{Peak FLOPS},\; \text{BW} \times \text{AI}\right)$$
Full Python ◄──────────────────────────────► Full Optimization
Eager jit.trace jit.script FX torch.compile
│ │ │ │ │
│ No graph │ Frozen │ Parsed │ Symbolic │ Dynamic
│ capture │ snapshot │ subset │ proxy │ bytecode
│ │ │ │ │ interception
│ │ │ │ │
Flexibility ★★★★★ ★★★ ★★ ★★★★
Optimization ☆ ★★ ★★ ★★★★★
Can these ops be fused?
│
Are they element-wise YES ──► FUSE (pointwise fusion)
with same shapes?
│ NO
│
Is one a producer and YES ──► Is the consumer element-wise?
the other a consumer? │
│ NO YES ──► FUSE (vertical fusion)
│ NO ──► Is it matmul epilogue?
Do they share │
the same input? YES ──► EPILOGUE FUSE
│ NO ──► CANNOT FUSE
YES ──► HORIZONTAL FUSE
NO ──► CANNOT FUSE
Special barrier: REDUCTIONS must complete before downstream fusion.
┌─► Profile (measure) ─► Identify bottleneck ─► Hypothesize fix ─┐
│ │
└──────────── Validate improvement ◄── Apply optimization ◄───────┘
NEVER optimize without measuring first.
NEVER claim improvement without measuring after.
Answer each question before checking. Rate your confidence (1–5).
Q1. An A100 has 2 TB/s memory bandwidth and 312 TFLOPS FP16 compute. What is the arithmetic intensity ridge point?
Q2. A relu() on an FP32 tensor performs 1 FLOP per element and accesses 8 bytes per element (4 read + 4 write). Is this operation compute-bound or memory-bound on an A100?
Q3. You have a chain: y = sigmoid(relu(x * w + b)). How many HBM read/write operations occur unfused (assuming each intermediate is written and read back)? How many with full fusion?
Q4. What is the key difference between torch.jit.trace and torch.jit.script?
Q5. Why does torch.jit.trace give wrong results for models with data-dependent if statements?
Q6. What is a "graph break" in torch.compile, and why is it bad?
Q7. Can x / x.sum() be fused into a single kernel? Why or why not?
Q8. What does the Roofline model tell you that a simple latency measurement does not?
Q9. Name three things a compiler can do with a computation graph that it cannot do in eager mode.
Q10. A colleague claims their custom CUDA kernel is "optimal" because it uses 100% of compute FLOPS. But the operation is element-wise ReLU. What's wrong with this claim?
Reality: An operation with 10× the FLOPs can be faster than one with fewer FLOPs if the high-FLOP operation is compute-bound and efficiently utilizing the GPU, while the low-FLOP operation is memory-bound and bottlenecked by bandwidth. Matrix multiplication (many FLOPs, high AI) can be faster than a chain of element-wise ops (few FLOPs, low AI) on the same data.
Reality: torch.compile uses eager mode as a fallback for anything it can't trace. Graph breaks insert eager regions between compiled graphs. The two modes coexist — graph mode is an optimization layer, not a replacement.
Reality: Fusion helps for memory-bound operations. For compute-bound operations (large matmuls), fusion of the matmul itself is not the optimization — the matmul kernel already has high arithmetic intensity. Epilogue fusion (fusing a ReLU after matmul) helps, but fusing two large matmuls together rarely does.
Reality: torch.compile has:
- Compilation overhead: First call triggers compilation (seconds to minutes)
- Graph breaks: Non-trivial models may have many breaks, limiting optimization scope
- Guard overhead: Dynamic shapes require recompilation when shapes change
- Correctness risks: Subtle differences in floating-point behavior under fusion
Reality: Registers are not addressable memory — they're operand storage for ALU instructions. You can't take a pointer to a register. The compiler assigns variables to registers; you control this indirectly through code structure. In GPU programming (Triton/CUDA), register pressure affects occupancy, which affects latency hiding.
Check each box honestly. If you can't check ≥8/10, revisit the indicated days.
torch.profiler to capture and interpret a Chrome trace (Days 8–11)torch.compile and inspect generated Triton code with TORCH_LOGS (Days 12–13)| Score | Assessment | Action |
|---|---|---|
| 10/10 | Ready for Phase II | Proceed to Day 15 |
| 8–9/10 | Minor gaps | Review indicated days, then proceed |
| 5–7/10 | Significant gaps | Re-do the exercises from weak areas |
| <5/10 | Foundation incomplete | Revisit Week 1 before continuing |
Spend 10 minutes writing brief answers to these. Writing forces clarity.
What was the single most surprising thing you learned in Phase I? Why did it surprise you — what was your prior mental model?
Draw the data flow for a single y = relu(x + bias) call from Python through the dispatcher to GPU execution. Where does time go in eager mode? Where does fusion save time?
If you were designing a new ML framework from scratch, would you start with eager or graph mode? What would you sacrifice?
Name one topic where you feel shaky. Write one specific question you still have. (This becomes your personal focus for Phase II review.)
Explain to an imaginary colleague (who knows Python but not ML systems) why torch.compile can make their training loop 2× faster without changing any model code.
What changes in Phase II:
Phase I (Days 1-14): Phase II (Days 15-28):
───────────────── ──────────────────────
"What happens on the GPU" "How compilers generate code"
GPU architecture ──► Compiler IR design
Memory hierarchy ──► Optimization passes
Profiling & measurement ──► Lowering & code generation
Eager vs graph mode ──► Compiler frontend/backend
Operator fusion (intuition) ──► Formal fusion algorithms
You now know WHY things are slow. You'll learn HOW compilers fix it.
torch.compile as the primary optimization path, but understanding why it works requires the hardware foundations from Phase IDay 15: Compiler 101 for ML opens Phase II. We leave the GPU and zoom out to the classic compiler pipeline — lexing, parsing, IR, optimization passes, code generation — and see how ML compilers (XLA, TVM, TorchInductor) map onto this structure. The concepts you've built about what to optimize meet the machinery of how to optimize it.