Phase II · Week 4 · Day 28 of 70 · 2.5 hours
"A compiler is a pipeline of lowering steps, each trading generality for specificity. If you can trace an operation from Python source to GPU instructions, you understand the whole stack."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 27: Custom Triton Backend | Day 29: TVM Architecture Overview | Week 4: Triton & Kernel Engineering | Phase II: Compiler Fundamentals | ML Compilers |
This is the end of Phase II. Over the past two weeks, you've gone from Python-level graph capture to GPU-level Triton kernels. Before moving to TVM and the broader compiler ecosystem, we consolidate everything into a single mental model. The goal: you should be able to trace any PyTorch operation from model(x) through every compiler stage down to the Triton kernel that runs on hardware.
Layer 0: USER CODE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@torch.compile
def model(x):
return F.gelu(x @ W + b)
│ torch.compile decorator
▼
Layer 1: TORCHDYNAMO (Graph Capture)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Intercepts Python bytecode via PEP 523
• Traces into FX Graph (symbolic execution)
• Inserts guards (dtype, shape, device, value)
• Handles graph breaks (unsupported ops → subgraphs)
Output: FX Graph with high-level ATen ops
call_function aten.mm.default (x, W)
call_function aten.add.Tensor (mm, b)
call_function aten.gelu.default (add)
│ Sent to AOTAutograd
▼
Layer 2: AOTAUTOGRAD (Differentiation)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Splits graph into forward + backward
• Decomposes high-level ops into primitives
gelu → mul, erf, add, mul (5 primitive ops)
• Produces joint forward/backward graph
Output: Decomposed FX Graph (ATen primitives only)
aten.mm, aten.add, aten.mul, aten.erf, ...
│ Passed to backend (default: Inductor)
▼
Layer 3: TORCHINDUCTOR (Compilation)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
3a. Lowering: ATen ops → Inductor IR nodes
(Pointwise, Reduction, ExternKernel)
3b. Fusion: Group fusable nodes
pointwise + pointwise → 1 kernel
mm stays as ExternKernel (unfused)
3c. Scheduling: Choose loop ordering, block sizes
Handle reduction strategies
3d. Codegen: Emit Triton source code (GPU)
or C++/OpenMP code (CPU)
Output: triton_poi_fused_add_mul_erf_0.py
+ wrapper code orchestrating kernel launches
│ Triton JIT compiler
▼
Layer 4: TRITON COMPILER
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• Triton IR → Triton GPU IR → LLVM IR → PTX
• Block-level ops → thread-level scheduling
• Auto-manages shared memory, coalescing
• Applies autotuning for block sizes
Output: PTX assembly (→ cubin via ptxas)
│ CUDA driver
▼
Layer 5: GPU HARDWARE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
• SMs execute warps of 32 threads
• L1/L2 cache hierarchy
• HBM bandwidth: 1-3 TB/s
• Compute: hundreds of TFLOPS
┌───────────────────────┐
│ Python Source Code │
└───────────┬───────────┘
│
┌───────────▼───────────┐
│ FX Graph (ATen) │◄── Day 15-17: FX framework
│ nodes, edges, ops │ Day 18-19: passes & transforms
└───────────┬───────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌─────────▼──────┐ ┌──────▼───────┐ ┌───────▼───────┐
│ Dynamo │ │ AOTAutograd │ │ Graph Breaks │
│ (capture) │ │ (grad split) │ │ (subgraphs) │
│ Day 25 │ │ Day 25 │ │ Day 25 │
└─────────┬──────┘ └──────┬───────┘ └───────────────┘
│ │
└────────┬───────┘
│
┌─────────▼─────────┐
│ Decomposed Graph │◄── primitives only
│ (ATen primitives) │ no gelu, layer_norm
└─────────┬─────────┘
│
┌─────────────┼──────────────┐
│ │ │
┌──────▼──────┐ ┌───▼────┐ ┌──────▼──────┐
│ Inductor │ │ Custom │ │ Other │
│ (default) │ │Backend │ │ Backends │
│ Day 26 │ │Day 27 │ │ (TRT,etc.) │
└──────┬──────┘ └───┬────┘ └─────────────┘
│ │
└─────┬──────┘
│
┌──────────▼──────────┐
│ Inductor IR │◄── Pointwise, Reduction,
│ (loop-level) │ ExternKernel, Template
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Fusion Groups │◄── minimize kernels & memory
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Triton Source │◄── Day 22-24: Triton language
│ (Python) │ blocks, masks, tiling
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Triton Compiler │◄── IR → LLVM → PTX
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ GPU Execution │◄── warps, SMs, HBM
└─────────────────────┘
Every compiler is a series of lowering steps. Each step trades generality for specificity:
Generality ←──────────────────────→ Specificity
Python FX Graph Inductor IR Triton PTX
────── ──────── ────────── ────── ───
• Any code • Tensor ops • Loop nests • Blocks • Instructions
• Dynamic • Static shapes • Memory layout • Masks • Registers
• Objects • No objects • Fusion groups • Tiling • Warps
• Control flow • DAG only • Scheduled • JIT'd • Native
Key insight: information is lost at each step (you can't reconstruct arbitrary Python from PTX), but the remaining information is enough for the next stage to make good decisions.
N separate operations
┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐
│op│ │op│ │op│ │op│ │op│ │op│ 6 ops, 6 kernel launches
└──┘ └──┘ └──┘ └──┘ └──┘ └──┘ 12 memory round-trips
│
fusion
│
┌─────┴─────┐
│ fused_kern │ │extern│ 2 kernel launches
│ (4 ops) │ │(mm) │ 4 memory round-trips
└────────────┘ └──────┘
Fusion is the single most impactful optimization. It reduces: - Kernel launch overhead: $O(n) \to O(1)$ - Memory traffic: from $2n \times \text{tensor_size}$ to $2 \times \text{tensor_size}$ - The roofline shifts from memory-bound to compute-bound
First call: Second call (same shape):
┌─────────┐ ┌─────────┐
│ Dynamo │──▶ trace ──▶ compile ──▶ cache │ Dynamo │──▶ check guards
└─────────┘ (slow, ~1s) │ hit └─────────┘ │
▼ ▼
┌───────┐ ┌───────┐
│cached │ │cached │
│kernel │ │kernel │ ← fast!
└───────┘ └───────┘
Different shape:
┌─────────┐
│ Dynamo │──▶ check guards ──▶ FAIL ──▶ recompile ──▶ new cache entry
└─────────┘
Test yourself before checking the answers. Write your answers first, then compare.
Q1: What are the three stages of torch.compile's pipeline, and what does each produce?
Q2: What is a "graph break" and why does it reduce optimization opportunities?
Q3: In this code, how many Triton kernels will Inductor generate?
@torch.compile
def f(x, W):
y = x @ W # matmul
y = y + 1 # add
y = torch.relu(y) # relu
y = y * 0.5 # mul
return y.sum() # sum
Q4: Why does AOTAutograd decompose torch.gelu() into primitive ops?
Q5: What is the difference between a Pointwise kernel and a Reduction kernel in Inductor's codegen?
Q6: In Triton, what does tl.program_id(0) return and how does it relate to the CUDA concept of blockIdx.x?
Q7: You write a custom torch.compile backend. What is the function signature it must implement?
Q8: The fused kernel for y = relu(x * 2 + 1) loads from memory once and stores once. Without fusion, how many loads and stores would there be?
Q9: What does torch._inductor.config.debug = True do, and where does the output go?
Q10: When should you write a custom torch.compile backend instead of a custom lowering?
Map each concept to where it lives in the pipeline:
| Concept | Pipeline Stage | Key Insight |
|---|---|---|
| FX Graph | Dynamo output / IR | DAG of tensor ops, no Python control flow |
| Graph Breaks | Dynamo | Split graph → fewer fusion opportunities |
| Guards | Dynamo | Control caching and recompilation |
| Decomposition | AOTAutograd | High-level → primitives enables fusion |
| Forward/Backward split | AOTAutograd | Autograd runs ahead-of-time, not at runtime |
| Lowering | Inductor Stage 1 | ATen ops → Inductor IR nodes |
| Fusion | Inductor Stage 2 | Multiple ops → single kernel |
| Scheduling | Inductor Stage 3 | Loop ordering, block sizes, reductions |
| Codegen | Inductor Stage 4 | IR → Triton source code |
| Block programming | Triton | Programs process blocks, not elements |
| Autotuning | Triton | Search over BLOCK_SIZE, num_warps, etc. |
| Tiling | Triton (matmul) | Fit sub-blocks into SRAM for data reuse |
| Flash Attention | Triton pattern | Online softmax + tiling = $O(N)$ memory |
| Custom backend | torch.compile API | Replace or augment Inductor |
Before moving to Phase III (TVM, MLIR, and the broader compiler ecosystem), verify you can confidently answer these:
□ I can explain the full torch.compile pipeline (3 stages + Triton)
□ I can read an FX graph and identify ops, shapes, and data flow
□ I can write an FX graph transformation pass
□ I can write a Triton kernel for a pointwise operation
□ I can write a Triton kernel for matrix multiplication (tiled)
□ I understand why tiling improves arithmetic intensity
□ I can explain how Flash Attention achieves O(N) memory
□ I know what Inductor's fusion decisions look like and why
□ I can read Inductor-generated Triton code
□ I can register a custom torch.compile backend
□ I understand guards, graph breaks, and recompilation
□ I know when to use a custom lowering vs. a custom backend
If you checked 10+: You're ready for TVM. Phase III will feel like a natural extension.
If you checked 7–9: Review the gaps. Re-read the specific day's material and redo one exercise.
If you checked <7: Spend today's remaining time on the weakest areas. Consider re-doing Day 22 (Triton basics) or Day 25 (torch.compile) exercises before moving on.
Spend 10 minutes journaling on these. Writing forces clarity.
What was the single most surprising thing you learned in Phase II? Was it how much work torch.compile does behind a single decorator? How Triton manages to be competitive with CUDA? Something else?
Which concept felt hardest to internalize? Graph breaks and guards? Triton's block model? Inductor fusion rules? Identifying the hard parts tells you where to invest more time.
Can you explain the full pipeline to a colleague? Try it out loud: "When you call torch.compile, first Dynamo..." If you get stuck, that's the gap to fill.
What would you build with these tools? A custom kernel for your workload? A profiling backend? A specialized fusion pass? Having a project in mind makes Phase III more concrete.
How does this connect to your broader goals? Whether you're building ML infrastructure, optimizing models for production, or working on compiler research — where does this knowledge plug in?
Week 3 (Days 15-21): Compiler Infrastructure
──────────────────────────────────────────────
• FX Graph representation and tracing
• Writing compiler passes (constant folding, DCE, pattern matching)
• Operator fusion theory and practice
• Shape propagation and type inference
• Mini-project: building an FX optimization pass
Week 4 (Days 22-28): Triton & Kernel Engineering
──────────────────────────────────────────────────
• Triton's block-based programming model
• Matrix multiplication with tiling and SRAM reuse
• Flash Attention algorithm and implementation
• torch.compile internals (Dynamo → AOTAutograd → Inductor)
• Inductor code generation and fusion
• Writing custom torch.compile backends
• This reflection (connecting it all together)
Day 29 begins Phase III with TVM Architecture Overview. TVM takes the compiler ideas we've learned — IR, passes, scheduling, code generation — and applies them to a broader set of hardware targets: CPUs, GPUs, FPGAs, custom accelerators. Where Inductor is PyTorch-specific, TVM is framework-agnostic and hardware-agnostic. The patterns you've learned transfer directly.