Phase III · Week 7 · Day 49 of 70 · 2.5 hours
"You don't truly understand a system until you can draw its map from memory and explain where every edge leads."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 48: Compiler Testing & Verification | Day 50: Model Formats & ONNX | Week 7: TVM Advanced & MLC | Phase III: Apache TVM Deep Dive | ML Compilers |
You've completed Phase III — three weeks (Days 29–48) spanning the entire Apache TVM stack, the broader compiler ecosystem (MLIR, XLA, ONNX Runtime), MLC-LLM for LLM deployment, the contribution workflow, and compiler testing. That's 20 dense lessons of interconnected material. Before moving to Phase IV (inference optimization and deployment), you must consolidate: verify recall, identify gaps, and build the mental frameworks that let you make sound tool-selection decisions in practice. This session tests you, connects the dots, and certifies readiness.
Build this map from memory first, then verify against the reference:
Complete TVM Ecosystem — Phase III Concept Map
═══════════════════════════════════════════════════════════════════════════════
Input Models
├── PyTorch (torch.export / fx)
├── ONNX (onnx.load)
├── TFLite (flatbuffer)
└── Hugging Face (for MLC-LLM)
│
▼ Frontend Import (Day 30)
┌─────────────────────────────────────────────────────────────────────────┐
│ Relay IR (Days 30-31) │
│ • Typed functional graph: let-bound, SSA │
│ • Ops: nn.conv2d, nn.dense, add, reshape, ... │
│ • Passes: FuseOps, FoldConstant, SimplifyInference, ToMixedPrecision │
│ • Quantization: QNN dialect (Day 39) │
│ • Pattern matching: DFPatternCallback │
└────┬──────────────┬──────────────────────────┬─────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌───────────────────────────┐
│TE (Day32)│ │ BYOC (Day 38)│ │ Relax IR (Day 41) │
│ Compute │ │ Partition → │ │ • Symbolic shapes (n, m) │
│ + Schedule│ │ cuDNN/TRT/ │ │ • Mutable state (KV cache)│
│ │ │ DNNL/ACL │ │ • Dataflow blocks │
└─────┬────┘ └──────────────┘ │ • call_tir bridge to TIR │
│ └───────────┬───────────────┘
▼ │
┌──────────────────────────────────────────────────┘
│ TIR — Tensor IR (Day 33)
│ • Low-level: loops, buffers, index arithmetic
│ • Schedule primitives: split, reorder, bind, vectorize
│ • Buffer scopes: global, shared, local
└────┬───────────────────────────────────────────────────┐
│ │
▼ Tuning (Days 36-37) ▼ Code Generation
┌────────────────────────────────┐ ┌────────────────────────┐
│ AutoTVM → template-based │ │ LLVM → x86/ARM/RISC-V│
│ AutoScheduler → Ansor (auto) │ │ CUDA → NVIDIA GPU │
│ MetaSchedule → unified, TIR │ │ Metal → Apple GPU │
│ │ │ Vulkan → cross-GPU │
│ Cost model: XGBoost/NN │ │ OpenCL → mobile GPU │
│ Search: evolutionary/SA │ │ C codegen → µTVM │
└────────────────────────────────┘ │ WebGPU → browser │
└──────────┬─────────────┘
│
┌───────────────────▼──────────────┐
│ Runtime (Day 34, 40) │
│ Graph Executor │ AOT Executor │
│ VM Executor │ µTVM (bare metal)
│ RPC (remote) │ WASM (browser) │
└──────────────────────────────────┘
═══════ Broader Ecosystem (Week 7) ═══════
MLIR (Day 43) XLA / StableHLO (Day 44)
├── Dialect system ├── HLO IR (functional)
├── Linalg, Tensor, ├── XLA compiler (TPU/GPU/CPU)
│ Affine, SCF, Arith ├── StableHLO = portable HLO
├── Torch-MLIR └── torch_xla for PyTorch
│ (PyTorch → MLIR)
└── IREE (MLIR → deploy)
ONNX Runtime (Day 45) MLC-LLM (Day 46)
├── Execution Providers ├── HF model → Relax IR
│ (CPU, CUDA, TRT, etc.) ├── Quantization (q4f16_1)
├── Graph optimizations ├── Fused dequant-matmul
└── Broad model support ├── Auto-tuning per target
└── Universal deploy (phone→browser)
Rate each system across the dimensions that matter for deployment decisions:
ML Compiler Comparison Matrix
═════════════════════════════════════════════════════════════════════════════
Dimension │ TVM │ XLA │ Triton │ ORT │ MLIR
═══════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪══════════
Primary use case │ Multi- │ TPU + │ NVIDIA GPU│ Production│ Compiler
│ target │ Google │ kernels │ inference │ infra-
│ deploy │ ecosystem │ │ │ structure
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
Input format │ Relay/ │ StableHLO │ Python DSL│ ONNX │ Dialects
│ Relax/ONNX│ (from PT) │ (triton) │ │ (varied)
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
GPU performance │ ★★★★☆ │ ★★★★☆ │ ★★★★★ │ ★★★☆☆ │ N/A
(NVIDIA) │ (tuned) │ (fused) │ (hand opt)│ (EP dep.) │ (infra)
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
CPU performance │ ★★★★☆ │ ★★★☆☆ │ ✗ │ ★★★★☆ │ N/A
│ (LLVM) │ (XLA CPU) │ (GPU only)│ (MLAS) │
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
Edge / mobile │ ★★★★★ │ ★★☆☆☆ │ ✗ │ ★★★☆☆ │ ★★★★☆
│ (µTVM) │ (TF Lite) │ │ (NNAPI) │ (IREE)
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
Browser / WASM │ ★★★★★ │ ✗ │ ✗ │ ★★★☆☆ │ ★★☆☆☆
│ (WebGPU) │ │ │ (WASM EP) │
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
Auto-tuning │ ★★★★★ │ ★★★☆☆ │ ★★★☆☆ │ ✗ │ ✗
│ MetaSched │ (internal)│ (Autotune)│ │
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
Ease of use │ ★★☆☆☆ │ ★★★★☆ │ ★★★★☆ │ ★★★★★ │ ★☆☆☆☆
│ (steep) │ (torch_xla│ (Python) │ (1-liner) │ (C++ API)
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
LLM support │ ★★★★☆ │ ★★★★☆ │ ★★★★★ │ ★★★☆☆ │ ★★☆☆☆
│ (MLC-LLM) │ (Pax/Jax) │ (vLLM) │ (GenAI) │
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
Community size │ ★★★☆☆ │ ★★★★☆ │ ★★★★☆ │ ★★★★★ │ ★★★★★
│ (~900) │ (Google) │ (OpenAI) │ (MSFT) │ (LLVM)
───────────────────┼───────────┼───────────┼───────────┼───────────┼──────────
Open governance │ ★★★★★ │ ★★☆☆☆ │ ★★★☆☆ │ ★★★☆☆ │ ★★★★★
│ (Apache) │ (Google) │ (OpenAI) │ (MSFT) │ (LLVM)
═══════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪══════════
Decision Flowchart:
┌─────────────────────────────────────────────────────────────────┐
│ What's your target? │
│ │
│ NVIDIA GPU only ──→ Triton (best perf) or TensorRT (production)│
│ Google TPU ──→ XLA + JAX (only option) │
│ Apple Silicon ──→ TVM/MLC-LLM (Metal) or CoreML │
│ Web browser ──→ TVM/MLC-LLM (WebGPU) or ORT (WASM) │
│ Multi-target ──→ TVM (best coverage) or ORT (easier) │
│ Edge / MCU ──→ TVM µTVM or IREE (MLIR) │
│ "Just works" ──→ ONNX Runtime (broadest model support) │
│ LLM serving ──→ vLLM/Triton (NVIDIA) or MLC-LLM (anywhere)│
└─────────────────────────────────────────────────────────────────┘
Answer from memory. Score yourself honestly — each question is worth 1 point.
Q1. What are the three major IRs in TVM's compilation stack, and what level of abstraction does each operate at?
Q2. Explain why (a + b) + c ≠ a + (b + c) in floating-point arithmetic. Give a concrete example with numbers.
Q3. What is the purpose of FuseOps in Relay? Name two types of operators it fuses together.
Q4. A TIR PrimFunc uses tir.Split(i, factor=32). What does this produce, and why is the factor choice important?
Q5. What is BYOC? Name two external libraries TVM can offload to via BYOC.
Q6. In MLC-LLM, what does "q4f16_1" mean? Break down each part.
Q7. Why does Relax exist when Relay already works? Name two capabilities Relax has that Relay lacks.
Q8. What tolerance would you use for np.allclose when comparing a matmul with $K=4096$ in float32? Justify with the formula.
Q9. Explain the difference between AutoTVM, AutoScheduler (Ansor), and MetaSchedule. When would you choose each?
Q10. You need to deploy a model on both NVIDIA GPUs and Apple M-series chips. Which ML compiler ecosystem would you choose and why?
Score Interpretation
════════════════════
10/10 ★★★★★ Phase III mastery — proceed to Phase IV immediately
8-9 ★★★★☆ Strong understanding — review weak areas, then proceed
6-7 ★★★☆☆ Adequate — revisit 2-3 weakest days before moving on
4-5 ★★☆☆☆ Gaps remain — re-read key lessons (30, 33, 36, 41, 46)
0-3 ★☆☆☆☆ Insufficient — redo Phase III before advancing
Recurring Patterns Across Phase III
════════════════════════════════════
Theme 1: PROGRESSIVE LOWERING
─────────────────────────────
Relay/Relax (graph) → TIR (loops) → LLVM IR → machine code
MLIR: Torch → Linalg → Affine → SCF → LLVM
XLA: StableHLO → HLO → target code
Pattern: always start high, lower through well-defined stages
Theme 2: SEPARATION OF CONCERNS
────────────────────────────────
TVM: "what to compute" (TE) vs "how to compute" (Schedule)
MLIR: "semantics" (Linalg) vs "mapping" (Affine/SCF)
Triton: "algorithm" (Python) vs "tiling" (compiler)
Pattern: separate algorithm specification from optimization
Theme 3: HARDWARE ABSTRACTION
─────────────────────────────
TVM targets: "llvm", "cuda", "metal", "vulkan"
MLIR backends: LLVM, SPIRV, GPU dialects
ORT providers: CPUExecutionProvider, CUDAExecutionProvider
Pattern: single IR → multiple backends through target descriptors
Theme 4: SEARCH OVER PROGRAMS
─────────────────────────────
AutoTVM: search schedule templates
MetaSchedule: search TIR transformations
Halide: search schedule space
Pattern: correct programs form a space; search finds the fast one
Theme 5: FUSION IS KING
───────────────────────
Relay FuseOps: fuse element-wise chains
XLA: whole-graph fusion
Triton: programmer-defined fusion
MLC-LLM: dequantize-matmul fusion
Pattern: eliminate intermediate memory traffic
Phase IV covers inference optimization and deployment (model formats, quantization-at-scale, serving, edge deployment). Verify readiness:
tvm.testing utilitiesPhase III Gave You:
═══════════════════
✓ How ML compilers work internally (IR → passes → codegen)
✓ How to write and optimize compute kernels (TE + TIR + schedules)
✓ How auto-tuning finds fast schedules (MetaSchedule)
✓ How to deploy to diverse hardware (LLVM, CUDA, Metal, WebGPU)
✓ How the broader ecosystem fits together (MLIR, XLA, ORT)
✓ How to test compiler correctness (differential testing, fuzzing)
Phase IV Builds On This:
════════════════════════
Day 50-51: Model Formats & ONNX
└── How models are serialized, exchanged, and standardized
Day 52-53: Quantization at Scale
└── Post-training quantization, QAT, GPTQ, AWQ
(builds on TVM QNN from Day 39, MLC-LLM from Day 46)
Day 54-55: Inference Serving
└── vLLM, TensorRT-LLM, serving systems
(applies TVM/Triton kernels in production)
Day 56: Edge Deployment
└── TFLite, CoreML, µTVM in production
(extends Day 34 and Day 40)
You already have the foundation — Phase IV applies it.
Day 50 begins Phase IV with model serialization formats. You'll deep-dive into ONNX — the specification, protobuf schema, operator semantics, and how to manually inspect and modify ONNX graphs. This is the bridge between training frameworks and the deployment pipeline you've been building.