← Week 6: TVM Tuning & Backends

Day 42: Stop & Reflect #3

Phase III · Week 6 · Day 42 of 70 · 2.5 hours

"To know that you know what you know, and to know that you do not know what you do not know — that is true knowledge."


← Previous Next → 📅 Week 🔷 Phase 📚 Curriculum
Day 41: TVM Unity & Relax Day 43: MLIR for ML Week 6: TVM Tuning & Backends Phase III: Apache TVM Deep Dive ML Compilers

Why This Matters

You've spent two dense weeks (Days 29–41) inside Apache TVM — from architecture overview through Relay, TE, TIR, tuning, BYOC, quantization, edge deployment, and Relax. That's 13 days of interconnected material. Before moving to MLIR and the broader compiler ecosystem (Phase IV), you need to consolidate. This reflection session tests retention, builds connections between concepts, and equips you with a practical decision framework for choosing the right tool in real deployment scenarios.


1. Concept Map: The TVM Universe

Map every major concept from Weeks 5–6 and draw the connections:

                        TVM Compilation Stack — Full Picture
═══════════════════════════════════════════════════════════════════════════

  PyTorch / ONNX / TFLite
        │
        ▼ Frontend Import (Day 30)
  ┌─────────────────────────────────────────────────────────────────┐
  │                      Relay IR (Day 30-31)                       │
  │  • Functional graph: %x = nn.conv2d(%data, %weight)             │
  │  • Type/shape inference, pattern matching                       │
  │  • Optimization passes: FoldConstant, FuseOps, SimplifyInference│
  │  • QNN dialect for quantized models (Day 39)                    │
  └───────┬───────────────┬──────────────────────────┬──────────────┘
          │               │                          │
          ▼               ▼                          ▼
  ┌──────────────┐ ┌──────────────┐         ┌──────────────────┐
  │ TE (Day 32)  │ │ BYOC (Day 38)│         │ Relax (Day 41)   │
  │ Compute decl │ │ Partition to │         │ Next-gen IR      │
  │ A[i,j] =     │ │ external lib │         │ • Symbolic shapes│
  │  B[i,k]*C[k,j│ │ (cuDNN, DNNL,│         │ • Dataflow blocks│
  └──────┬───────┘ │  TensorRT)   │         │ • call_tir       │
         │         └──────────────┘         └─────────┬────────┘
         ▼                                            │
  ┌──────────────┐                                    │
  │ TIR (Day 33) │◀───────────────────────────────────┘
  │ Low-level IR │       (Relax calls TIR directly)
  │ • Loops      │
  │ • Buffers    │
  │ • Schedules  │
  └──────┬───────┘
         │
         ▼ Schedule Transformations
  ┌───────────────────────────────────────────────────────────┐
  │                 Tuning Layer (Days 36-37)                  │
  │                                                            │
  │  AutoTVM        AutoScheduler      MetaSchedule            │
  │  (template)     (Ansor, auto)      (unified, TIR-native)   │
  │  Day 36         Day 36             Day 37                  │
  │                                                            │
  │  Cost Model: XGBoost → predicts runtime from TIR features  │
  │  Search: random/GA/evolutionary over schedule knobs        │
  └──────────────────────────┬────────────────────────────────┘
                             │
                             ▼ Code Generation
  ┌───────────────────────────────────────────────────────────┐
  │                  Target Backends                           │
  │                                                            │
  │  LLVM → x86, ARM, RISC-V    CUDA/ROCm → GPU               │
  │  C codegen → microcontrollers  Hexagon → Qualcomm DSP      │
  │  Vulkan/Metal/OpenCL → mobile GPU                          │
  └──────────────────────────┬────────────────────────────────┘
                             │
                             ▼ Deployment (Days 34, 40)
  ┌───────────────────────────────────────────────────────────┐
  │                    Runtime Layer                            │
  │                                                            │
  │  Graph Executor     AOT Executor      µTVM (CRT)           │
  │  (~300 KB, JSON)    (~10 KB, C)       (bare-metal)         │
  │  Linux/Android      Linux/RTOS        Arduino/Zephyr       │
  │  Raspberry Pi       Jetson            Cortex-M4            │
  │                                                            │
  │  RPC: remote tune + benchmark on real hardware (Day 40)    │
  │  Quantization: INT8 via QNN calibration (Day 39)           │
  └───────────────────────────────────────────────────────────┘

Connection Exercise

For each arrow in the concept map, write a one-sentence explanation of the transformation:

From → To What Happens
PyTorch → Relay relay.frontend.from_pytorch traces the model and converts ops to Relay operators
Relay → TE FuseOps pass groups operators; each fused group gets a TE compute declaration
TE → TIR TE's te.create_prim_func() converts compute + schedule into loop-level TIR
TIR → LLVM/CUDA TIR codegen emits LLVM IR (for CPUs) or CUDA kernels (for GPUs)
Relay → BYOC Pattern matching annotates subgraphs; partitioning extracts them for external libs
Relax → TIR call_tir directly references TIR functions (no lowering pass needed)
TIR → Tuning MetaSchedule explores schedule knobs; cost model predicts; runner measures

2. Self-Check Quiz

Answer each question, then check against the reference answers below. Score yourself honestly.

Questions

Q1. What are the three main IRs in the classic TVM stack, ordered from highest to lowest abstraction?

Q2. In TE, what is the difference between a compute declaration and a schedule? Give a one-line example of each.

Q3. Name three Relay optimization passes and describe what each does in one sentence.

Q4. What problem does operator fusion solve, and what is the primary performance benefit?

Q5. In MetaSchedule, what role does the cost model play, and why is it necessary?

Q6. Explain the BYOC workflow in three steps.

Q7. What is the formula for converting a floating-point value $x$ to a quantized integer $q$ using scale $s$ and zero point $z$?

Q8. What is the key difference between the Graph Executor and the AOT Executor? When would you choose each?

Q9. In Relax, what does R.call_tir(func, args, out_sinfo) do that Relay's lowering cannot?

Q10. Why are first-class dynamic shapes important for LLM serving, and how does Relax handle them differently from Relay?

Reference Answers

Click to reveal answers **A1.** Relay (graph-level, functional) → TE (tensor expression, compute + schedule) → TIR (low-level, loops and buffers). **A2.** Compute declares *what* to calculate: `C[i,j] = te.sum(A[i,k] * B[k,j], axis=k)`. Schedule declares *how*: `s[C].tile(i, j, 32, 32)` — tiling the loop nest into 32×32 blocks. **A3.** (1) `FuseOps` — merges elementwise/broadcast operators into fused kernels to eliminate intermediate memory. (2) `FoldConstant` — pre-computes constant expressions at compile time. (3) `SimplifyInference` — replaces batch norm with scale+bias during inference. **A4.** Fusion eliminates **memory round-trips**. Without fusion, each operator writes its output to DRAM and the next reads it back. Fused operators pass data through registers/L1 cache, reducing memory bandwidth by orders of magnitude. **A5.** The cost model (XGBoost-based) **predicts kernel latency from TIR features** without running the kernel. This is necessary because measuring every candidate schedule on real hardware would take days — the cost model prunes the search space by 100–1000×, and only the top candidates are actually measured. **A6.** (1) **Pattern match** — identify subgraphs the external backend can handle (e.g., conv2d+bias+relu for cuDNN). (2) **Partition** — extract matched subgraphs into tagged composite functions. (3) **Unified runtime** — GraphExecutor dispatches between TVM-compiled and external-backend regions. **A7.** $q = \text{round}\left(\frac{x}{s}\right) + z$, where $s$ is the scale (float), $z$ is the zero point (integer). **A8.** Graph Executor uses a JSON graph description + interpreter loop at runtime (~300 KB overhead, needs `malloc`). AOT Executor bakes the execution order into a single C function (~10 KB overhead, no dynamic allocation needed). Choose Graph for Linux devices (easier debugging); choose AOT for microcontrollers (bare-metal, minimal RAM). **A9.** `call_tir` creates a **direct reference** from a Relax function to a TIR function within the same IRModule. This means Relax and TIR can be co-optimized and transformed together. Relay's lowering is a one-way door — once lowered to TIR, graph-level context is lost. **A10.** LLMs have variable batch sizes, sequence lengths, and KV cache sizes at every inference step. Relay uses `Any` as a placeholder with limited compiler reasoning. Relax uses **symbolic shape variables** (e.g., `("batch", "seq_len")`) that support arithmetic ($\text{new\_len} = \text{past\_len} + \text{seq\_len}$), enabling the compiler to pre-allocate buffers, specialize kernels, and optimize memory layout even with dynamic shapes.

Scoring

Score Assessment Action
9–10 Excellent — solid grasp of the full TVM stack Proceed to Phase IV
7–8 Good — minor gaps Re-read the missed topics, then proceed
5–6 Fair — several gaps Re-do the hands-on exercises for missed areas
≤4 Needs review Revisit Days 29–41 before continuing

3. Decision Framework: Choosing Your Tool

When deploying ML models, TVM is one of several options. Use this framework:

                  ML Deployment Decision Tree
═══════════════════════════════════════════════════════

  "I have a trained model. How should I deploy it?"
                    │
            ┌───────┴────────┐
            │ Target device? │
            └───────┬────────┘
                    │
       ┌────────────┼────────────────┐
       │            │                │
   NVIDIA GPU   CPU / ARM      Microcontroller
       │            │                │
       ▼            ▼                ▼
  ┌─────────┐ ┌──────────┐    ┌──────────┐
  │ Torch   │ │ ORT /    │    │ µTVM     │
  │ compile │ │ TVM /    │    │ (AOT)    │
  │ or      │ │ torch    │    │          │
  │ TensorRT│ │ compile  │    │ Only     │
  │ or TVM  │ │          │    │ option!  │
  └────┬────┘ └────┬─────┘    └──────────┘
       │           │
       ▼           ▼
  Performance vs Effort tradeoff (see table below)

Detailed Comparison Matrix

Criterion TVM torch.compile Triton ONNX Runtime TensorRT
Primary strength Multi-target, auto-tuning PyTorch-native, zero friction Custom GPU kernels Cross-platform inference Max NVIDIA perf
Target devices CPU, GPU, ARM, MCU, DSP CPU, GPU (CUDA, ROCm) NVIDIA GPU only CPU, GPU, NPU NVIDIA GPU only
Dynamic shapes Relax: ✅ Relay: limited ✅ (torch.export) Manual ✅ (with profiles)
Setup effort High (compile TVM, tune) Low (pip install) Medium (write kernels) Low (pip install) Medium (TRT builder)
INT8 quantization ✅ (QNN + calibration) Via Torch AO Manual ✅ (QDQ nodes) ✅ (PTQ + QAT)
Microcontrollers ✅ (µTVM) ✗ (too heavy)
Custom operators TE / TIR / BYOC torch.library Native (write them!) Custom op API Plugin API
Tuning AutoTVM, MetaSchedule Inductor + Triton Manual Graph optimizations TRT builder
Latency (after tuning) ★★★★☆ ★★★☆☆ ★★★★★ ★★★☆☆ ★★★★★
Developer experience ★★☆☆☆ ★★★★★ ★★★☆☆ ★★★★☆ ★★★☆☆

Decision Heuristics

def choose_tool(target, model_type, team_expertise, constraints):
    """
    Practical decision guide.
    """
    # Rule 1: Microcontrollers → TVM (no alternative)
    if target in ("cortex-m", "arduino", "zephyr"):
        return "TVM (µTVM + AOT)"

    # Rule 2: NVIDIA GPU, maximum throughput, known model → TensorRT
    if target == "nvidia_gpu" and model_type == "standard" and "latency" in constraints:
        return "TensorRT (or torch.compile with TRT backend)"

    # Rule 3: Custom GPU kernel needed → Triton
    if "custom_kernel" in constraints and target == "nvidia_gpu":
        return "Triton (write the kernel, wrap as torch op)"

    # Rule 4: Multi-target deployment → TVM or ONNX Runtime
    if len(target) > 1 or target in ("arm_cpu", "risc_v", "hexagon"):
        return "TVM (cross-compile, auto-tune per target)"

    # Rule 5: PyTorch team, fast iteration, good enough perf → torch.compile
    if team_expertise == "pytorch" and "maximum_perf" not in constraints:
        return "torch.compile (Inductor backend)"

    # Rule 6: Cross-framework (ONNX model), easy deployment → ORT
    if model_type == "onnx":
        return "ONNX Runtime (+ EP for GPU acceleration)"

    # Default: start with torch.compile, profile, escalate if needed
    return "torch.compile → profile → escalate to TVM/TRT if bottlenecked"

4. Reflection Prompts

Take 15 minutes to write short answers to these questions. There are no wrong answers — they're for building your mental model.

Architecture & Design

  1. Why did TVM introduce three separate IRs (Relay, TE, TIR) instead of one? What are the trade-offs of this layered design, and how does Relax attempt to address them?

  2. If you were designing a new ML compiler from scratch today, which TVM design decisions would you keep and which would you change? Consider what you know about MLIR (coming in Phase IV).

Practical Engineering

  1. You need to deploy a ResNet-50 model on four different targets: NVIDIA A100, Intel Xeon, Raspberry Pi 4, and an STM32 Cortex-M7. Describe your compilation strategy for each.

  2. Your auto-tuned TVM model is 10% slower than cuDNN on conv2d. Walk through your debugging process. What would you check first? When would you fall back to BYOC?

Connections & Gaps

  1. What concept from Weeks 5-6 was hardest for you to understand? Write a one-paragraph explanation of it as if teaching a colleague — teaching reveals gaps.

  2. Predict three ways the TVM ecosystem will evolve over the next two years. Consider Relax adoption, LLM compilation, and the relationship with PyTorch.


5. Knowledge Consolidation Exercise

Build a Cheat Sheet

Create a one-page (front and back) TVM cheat sheet covering:

Section Content
IR Summary Relay, TE, TIR, Relax — one sentence each
Key APIs relay.build, relay.frontend.from_*, meta_schedule.tune_tvm
Target Strings LLVM (x86, ARM), CUDA, C (MCU) — with -mattr flags
Tuning AutoTVM vs AutoScheduler vs MetaSchedule — when to use each
Deployment Graph executor vs AOT vs µTVM — decision criteria
Quick Recipes PyTorch → TVM → benchmark (5-line version)

Compile a "Gotchas" List

From your experience in the exercises, document:

TVM Gotchas (personal notes)
═════════════════════════════
1. Target string must match actual hardware — wrong -mattr = SIGILL on device
2. AutoTVM templates ≠ MetaSchedule — can't mix tuning logs between them  
3. relay.build() returns a Module — must call export_library() for deployment
4. QNN calibration needs representative data — random input ≠ real distribution
5. Cross-compilation requires matching toolchain (aarch64-linux-gnu-gcc)
6. RPC server and tracker must use matching TVM versions
7. USMP (memory planning) only works with AOT executor
8. ...add your own from the exercises...

6. Progress Check: Phase III Complete

You've finished Phase III — the Apache TVM deep dive. Here's what you covered:

Phase III: Apache TVM Deep Dive — Summary
══════════════════════════════════════════

Week 5: TVM Foundations (Days 29-35)
  ✓ Day 29: Architecture overview (compiler vs runtime)
  ✓ Day 30: Relay IR (graph-level, type system, ops)
  ✓ Day 31: Relay optimization passes (FuseOps, FoldConstant, layout)
  ✓ Day 32: Tensor Expressions (compute declarations)
  ✓ Day 33: TIR & schedules (loop transformations)
  ✓ Day 34: Runtime & deployment (Module, RPC)
  ✓ Day 35: Mini-project (end-to-end MobileNetV2)

Week 6: TVM Tuning & Backends (Days 36-42)
  ✓ Day 36: AutoTVM & AutoScheduler (template vs template-free)
  ✓ Day 37: MetaSchedule (unified tuning framework)
  ✓ Day 38: BYOC — Bring Your Own Codegen (hybrid compilation)
  ✓ Day 39: Quantization (QNN dialect, calibration, INT8)
  ✓ Day 40: Edge devices (cross-compilation, µTVM, AOT, memory planning)
  ✓ Day 41: TVM Unity & Relax (next-gen IR, dynamic shapes)
  ✓ Day 42: Stop & Reflect #3 (this session)

Skills acquired:
  • Compile any PyTorch/ONNX model through TVM
  • Write custom TE compute + schedules
  • Auto-tune for specific hardware targets
  • Deploy to devices from GPUs to microcontrollers
  • Understand quantization pipeline end-to-end
  • Read and write Relax programs

Readiness for Phase IV

Phase IV (Weeks 7–8) covers MLIR and the broader compiler ecosystem: MLIR dialects, Torch-MLIR, StableHLO, IREE, and how they relate to TVM. You'll see how many TVM concepts (IRs, passes, lowering, backends) map directly to MLIR's more general framework.

Before proceeding, verify: - [ ] Can explain the TVM compilation pipeline from PyTorch import to target code - [ ] Can write a TE compute declaration and apply basic schedules - [ ] Understand the purpose of auto-tuning and how MetaSchedule works - [ ] Know when to use Graph Executor vs AOT Executor vs µTVM - [ ] Can describe what Relax improves over Relay - [ ] Scored ≥ 7/10 on the self-check quiz above


Key Takeaways

  1. Concept maps reveal connections — the TVM stack is not a linear pipeline but a graph with multiple paths (Relay→TIR, Relay→BYOC, Relax→TIR)
  2. Self-assessment prevents false confidence — scoring below 7/10 means gaps that will compound in Phase IV
  3. No single tool wins everywhere — TVM excels at multi-target and edge; torch.compile wins on developer experience; Triton/TensorRT win on raw NVIDIA GPU performance
  4. The decision framework is situational — target device, model type, team expertise, and deployment constraints all factor in
  5. Phase III knowledge maps directly to Phase IV — MLIR's dialects ≈ TVM's IRs, MLIR's passes ≈ TVM's passes, MLIR's lowering ≈ TVM's codegen

Further Reading


Tomorrow: MLIR for ML

Day 43 begins Phase IV with a deep dive into MLIR — the Multi-Level Intermediate Representation framework from Google/LLVM. You'll see how MLIR's dialect system generalizes TVM's IR stack, and why MLIR is becoming the foundation for the next generation of ML compilers (Torch-MLIR, StableHLO, IREE, and more).

← Day 41: TVM Unity & Relax Day 43: MLIR for ML →