Phase III · Week 6 · Day 42 of 70 · 2.5 hours
"To know that you know what you know, and to know that you do not know what you do not know — that is true knowledge."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 41: TVM Unity & Relax | Day 43: MLIR for ML | Week 6: TVM Tuning & Backends | Phase III: Apache TVM Deep Dive | ML Compilers |
You've spent two dense weeks (Days 29–41) inside Apache TVM — from architecture overview through Relay, TE, TIR, tuning, BYOC, quantization, edge deployment, and Relax. That's 13 days of interconnected material. Before moving to MLIR and the broader compiler ecosystem (Phase IV), you need to consolidate. This reflection session tests retention, builds connections between concepts, and equips you with a practical decision framework for choosing the right tool in real deployment scenarios.
Map every major concept from Weeks 5–6 and draw the connections:
TVM Compilation Stack — Full Picture
═══════════════════════════════════════════════════════════════════════════
PyTorch / ONNX / TFLite
│
▼ Frontend Import (Day 30)
┌─────────────────────────────────────────────────────────────────┐
│ Relay IR (Day 30-31) │
│ • Functional graph: %x = nn.conv2d(%data, %weight) │
│ • Type/shape inference, pattern matching │
│ • Optimization passes: FoldConstant, FuseOps, SimplifyInference│
│ • QNN dialect for quantized models (Day 39) │
└───────┬───────────────┬──────────────────────────┬──────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ TE (Day 32) │ │ BYOC (Day 38)│ │ Relax (Day 41) │
│ Compute decl │ │ Partition to │ │ Next-gen IR │
│ A[i,j] = │ │ external lib │ │ • Symbolic shapes│
│ B[i,k]*C[k,j│ │ (cuDNN, DNNL,│ │ • Dataflow blocks│
└──────┬───────┘ │ TensorRT) │ │ • call_tir │
│ └──────────────┘ └─────────┬────────┘
▼ │
┌──────────────┐ │
│ TIR (Day 33) │◀───────────────────────────────────┘
│ Low-level IR │ (Relax calls TIR directly)
│ • Loops │
│ • Buffers │
│ • Schedules │
└──────┬───────┘
│
▼ Schedule Transformations
┌───────────────────────────────────────────────────────────┐
│ Tuning Layer (Days 36-37) │
│ │
│ AutoTVM AutoScheduler MetaSchedule │
│ (template) (Ansor, auto) (unified, TIR-native) │
│ Day 36 Day 36 Day 37 │
│ │
│ Cost Model: XGBoost → predicts runtime from TIR features │
│ Search: random/GA/evolutionary over schedule knobs │
└──────────────────────────┬────────────────────────────────┘
│
▼ Code Generation
┌───────────────────────────────────────────────────────────┐
│ Target Backends │
│ │
│ LLVM → x86, ARM, RISC-V CUDA/ROCm → GPU │
│ C codegen → microcontrollers Hexagon → Qualcomm DSP │
│ Vulkan/Metal/OpenCL → mobile GPU │
└──────────────────────────┬────────────────────────────────┘
│
▼ Deployment (Days 34, 40)
┌───────────────────────────────────────────────────────────┐
│ Runtime Layer │
│ │
│ Graph Executor AOT Executor µTVM (CRT) │
│ (~300 KB, JSON) (~10 KB, C) (bare-metal) │
│ Linux/Android Linux/RTOS Arduino/Zephyr │
│ Raspberry Pi Jetson Cortex-M4 │
│ │
│ RPC: remote tune + benchmark on real hardware (Day 40) │
│ Quantization: INT8 via QNN calibration (Day 39) │
└───────────────────────────────────────────────────────────┘
For each arrow in the concept map, write a one-sentence explanation of the transformation:
| From → To | What Happens |
|---|---|
| PyTorch → Relay | relay.frontend.from_pytorch traces the model and converts ops to Relay operators |
| Relay → TE | FuseOps pass groups operators; each fused group gets a TE compute declaration |
| TE → TIR | TE's te.create_prim_func() converts compute + schedule into loop-level TIR |
| TIR → LLVM/CUDA | TIR codegen emits LLVM IR (for CPUs) or CUDA kernels (for GPUs) |
| Relay → BYOC | Pattern matching annotates subgraphs; partitioning extracts them for external libs |
| Relax → TIR | call_tir directly references TIR functions (no lowering pass needed) |
| TIR → Tuning | MetaSchedule explores schedule knobs; cost model predicts; runner measures |
Answer each question, then check against the reference answers below. Score yourself honestly.
Q1. What are the three main IRs in the classic TVM stack, ordered from highest to lowest abstraction?
Q2. In TE, what is the difference between a compute declaration and a schedule? Give a one-line example of each.
Q3. Name three Relay optimization passes and describe what each does in one sentence.
Q4. What problem does operator fusion solve, and what is the primary performance benefit?
Q5. In MetaSchedule, what role does the cost model play, and why is it necessary?
Q6. Explain the BYOC workflow in three steps.
Q7. What is the formula for converting a floating-point value $x$ to a quantized integer $q$ using scale $s$ and zero point $z$?
Q8. What is the key difference between the Graph Executor and the AOT Executor? When would you choose each?
Q9. In Relax, what does R.call_tir(func, args, out_sinfo) do that Relay's lowering cannot?
Q10. Why are first-class dynamic shapes important for LLM serving, and how does Relax handle them differently from Relay?
| Score | Assessment | Action |
|---|---|---|
| 9–10 | Excellent — solid grasp of the full TVM stack | Proceed to Phase IV |
| 7–8 | Good — minor gaps | Re-read the missed topics, then proceed |
| 5–6 | Fair — several gaps | Re-do the hands-on exercises for missed areas |
| ≤4 | Needs review | Revisit Days 29–41 before continuing |
When deploying ML models, TVM is one of several options. Use this framework:
ML Deployment Decision Tree
═══════════════════════════════════════════════════════
"I have a trained model. How should I deploy it?"
│
┌───────┴────────┐
│ Target device? │
└───────┬────────┘
│
┌────────────┼────────────────┐
│ │ │
NVIDIA GPU CPU / ARM Microcontroller
│ │ │
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│ Torch │ │ ORT / │ │ µTVM │
│ compile │ │ TVM / │ │ (AOT) │
│ or │ │ torch │ │ │
│ TensorRT│ │ compile │ │ Only │
│ or TVM │ │ │ │ option! │
└────┬────┘ └────┬─────┘ └──────────┘
│ │
▼ ▼
Performance vs Effort tradeoff (see table below)
| Criterion | TVM | torch.compile | Triton | ONNX Runtime | TensorRT |
|---|---|---|---|---|---|
| Primary strength | Multi-target, auto-tuning | PyTorch-native, zero friction | Custom GPU kernels | Cross-platform inference | Max NVIDIA perf |
| Target devices | CPU, GPU, ARM, MCU, DSP | CPU, GPU (CUDA, ROCm) | NVIDIA GPU only | CPU, GPU, NPU | NVIDIA GPU only |
| Dynamic shapes | Relax: ✅ Relay: limited | ✅ (torch.export) | Manual | ✅ | ✅ (with profiles) |
| Setup effort | High (compile TVM, tune) | Low (pip install) | Medium (write kernels) | Low (pip install) | Medium (TRT builder) |
| INT8 quantization | ✅ (QNN + calibration) | Via Torch AO | Manual | ✅ (QDQ nodes) | ✅ (PTQ + QAT) |
| Microcontrollers | ✅ (µTVM) | ✗ | ✗ | ✗ (too heavy) | ✗ |
| Custom operators | TE / TIR / BYOC | torch.library | Native (write them!) | Custom op API | Plugin API |
| Tuning | AutoTVM, MetaSchedule | Inductor + Triton | Manual | Graph optimizations | TRT builder |
| Latency (after tuning) | ★★★★☆ | ★★★☆☆ | ★★★★★ | ★★★☆☆ | ★★★★★ |
| Developer experience | ★★☆☆☆ | ★★★★★ | ★★★☆☆ | ★★★★☆ | ★★★☆☆ |
def choose_tool(target, model_type, team_expertise, constraints):
"""
Practical decision guide.
"""
# Rule 1: Microcontrollers → TVM (no alternative)
if target in ("cortex-m", "arduino", "zephyr"):
return "TVM (µTVM + AOT)"
# Rule 2: NVIDIA GPU, maximum throughput, known model → TensorRT
if target == "nvidia_gpu" and model_type == "standard" and "latency" in constraints:
return "TensorRT (or torch.compile with TRT backend)"
# Rule 3: Custom GPU kernel needed → Triton
if "custom_kernel" in constraints and target == "nvidia_gpu":
return "Triton (write the kernel, wrap as torch op)"
# Rule 4: Multi-target deployment → TVM or ONNX Runtime
if len(target) > 1 or target in ("arm_cpu", "risc_v", "hexagon"):
return "TVM (cross-compile, auto-tune per target)"
# Rule 5: PyTorch team, fast iteration, good enough perf → torch.compile
if team_expertise == "pytorch" and "maximum_perf" not in constraints:
return "torch.compile (Inductor backend)"
# Rule 6: Cross-framework (ONNX model), easy deployment → ORT
if model_type == "onnx":
return "ONNX Runtime (+ EP for GPU acceleration)"
# Default: start with torch.compile, profile, escalate if needed
return "torch.compile → profile → escalate to TVM/TRT if bottlenecked"
Take 15 minutes to write short answers to these questions. There are no wrong answers — they're for building your mental model.
Why did TVM introduce three separate IRs (Relay, TE, TIR) instead of one? What are the trade-offs of this layered design, and how does Relax attempt to address them?
If you were designing a new ML compiler from scratch today, which TVM design decisions would you keep and which would you change? Consider what you know about MLIR (coming in Phase IV).
You need to deploy a ResNet-50 model on four different targets: NVIDIA A100, Intel Xeon, Raspberry Pi 4, and an STM32 Cortex-M7. Describe your compilation strategy for each.
Your auto-tuned TVM model is 10% slower than cuDNN on conv2d. Walk through your debugging process. What would you check first? When would you fall back to BYOC?
What concept from Weeks 5-6 was hardest for you to understand? Write a one-paragraph explanation of it as if teaching a colleague — teaching reveals gaps.
Predict three ways the TVM ecosystem will evolve over the next two years. Consider Relax adoption, LLM compilation, and the relationship with PyTorch.
Create a one-page (front and back) TVM cheat sheet covering:
| Section | Content |
|---|---|
| IR Summary | Relay, TE, TIR, Relax — one sentence each |
| Key APIs | relay.build, relay.frontend.from_*, meta_schedule.tune_tvm |
| Target Strings | LLVM (x86, ARM), CUDA, C (MCU) — with -mattr flags |
| Tuning | AutoTVM vs AutoScheduler vs MetaSchedule — when to use each |
| Deployment | Graph executor vs AOT vs µTVM — decision criteria |
| Quick Recipes | PyTorch → TVM → benchmark (5-line version) |
From your experience in the exercises, document:
TVM Gotchas (personal notes)
═════════════════════════════
1. Target string must match actual hardware — wrong -mattr = SIGILL on device
2. AutoTVM templates ≠ MetaSchedule — can't mix tuning logs between them
3. relay.build() returns a Module — must call export_library() for deployment
4. QNN calibration needs representative data — random input ≠ real distribution
5. Cross-compilation requires matching toolchain (aarch64-linux-gnu-gcc)
6. RPC server and tracker must use matching TVM versions
7. USMP (memory planning) only works with AOT executor
8. ...add your own from the exercises...
You've finished Phase III — the Apache TVM deep dive. Here's what you covered:
Phase III: Apache TVM Deep Dive — Summary
══════════════════════════════════════════
Week 5: TVM Foundations (Days 29-35)
✓ Day 29: Architecture overview (compiler vs runtime)
✓ Day 30: Relay IR (graph-level, type system, ops)
✓ Day 31: Relay optimization passes (FuseOps, FoldConstant, layout)
✓ Day 32: Tensor Expressions (compute declarations)
✓ Day 33: TIR & schedules (loop transformations)
✓ Day 34: Runtime & deployment (Module, RPC)
✓ Day 35: Mini-project (end-to-end MobileNetV2)
Week 6: TVM Tuning & Backends (Days 36-42)
✓ Day 36: AutoTVM & AutoScheduler (template vs template-free)
✓ Day 37: MetaSchedule (unified tuning framework)
✓ Day 38: BYOC — Bring Your Own Codegen (hybrid compilation)
✓ Day 39: Quantization (QNN dialect, calibration, INT8)
✓ Day 40: Edge devices (cross-compilation, µTVM, AOT, memory planning)
✓ Day 41: TVM Unity & Relax (next-gen IR, dynamic shapes)
✓ Day 42: Stop & Reflect #3 (this session)
Skills acquired:
• Compile any PyTorch/ONNX model through TVM
• Write custom TE compute + schedules
• Auto-tune for specific hardware targets
• Deploy to devices from GPUs to microcontrollers
• Understand quantization pipeline end-to-end
• Read and write Relax programs
Phase IV (Weeks 7–8) covers MLIR and the broader compiler ecosystem: MLIR dialects, Torch-MLIR, StableHLO, IREE, and how they relate to TVM. You'll see how many TVM concepts (IRs, passes, lowering, backends) map directly to MLIR's more general framework.
Before proceeding, verify: - [ ] Can explain the TVM compilation pipeline from PyTorch import to target code - [ ] Can write a TE compute declaration and apply basic schedules - [ ] Understand the purpose of auto-tuning and how MetaSchedule works - [ ] Know when to use Graph Executor vs AOT Executor vs µTVM - [ ] Can describe what Relax improves over Relay - [ ] Scored ≥ 7/10 on the self-check quiz above
Day 43 begins Phase IV with a deep dive into MLIR — the Multi-Level Intermediate Representation framework from Google/LLVM. You'll see how MLIR's dialect system generalizes TVM's IR stack, and why MLIR is becoming the foundation for the next generation of ML compilers (Torch-MLIR, StableHLO, IREE, and more).