Phase III · Week 7 · Day 48 of 70 · 2.5 hours
"A compiler that's fast but wrong is worse than no compiler at all — you won't even know when it's lying to you."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 47: Contributing to TVM | Day 49: Stop & Reflect #4 | Week 7: TVM Advanced & MLC | Phase III: Apache TVM Deep Dive | ML Compilers |
ML compilers transform high-level models into low-level code through dozens of passes — operator fusion, constant folding, quantization, tiling, vectorization, memory planning. Each pass can introduce bugs: a fused kernel may silently drop a channel, a schedule may produce an out-of-bounds access that corrupts memory, or a quantization step may introduce catastrophic numerical drift. Unlike application bugs that crash visibly, compiler bugs produce wrong numbers silently. Your model runs, reports 92% accuracy, but it's actually 85% — and you don't discover the gap until production. This lesson covers the testing strategies that prevent this: numerical comparison, fuzzing, differential testing, and TVM's specific testing infrastructure.
ML Compiler Bug Taxonomy
════════════════════════
Type 1: Crash Bugs (easy to detect)
├── Segfault in generated code
├── Shape mismatch assertion
└── Out-of-memory during compilation
Detection: automatic (program crashes)
Type 2: Silent Numerical Bugs (DANGEROUS)
├── Fused kernel produces slightly wrong output
├── Quantization rounds in wrong direction
├── Memory layout mismatch (NCHW vs NHWC swap)
└── Reduction over wrong axis
Detection: requires explicit numerical validation
Type 3: Performance Bugs (subtle)
├── Schedule produces correct but slow code
├── Missed fusion opportunity
└── Suboptimal memory access pattern
Detection: requires benchmarking against baseline
The danger spectrum:
┌──────────────────────────────────────────────────────┐
│ Easy to detect Hard to detect │
│ ←──────────────────────────────────────────────→ │
│ Crash Shape error 1% drift 0.01% drift │
│ ✓ ✓ ✗ ✗✗ │
│ │
│ A 0.01% per-layer drift across 100 layers = │
│ 1 − 0.9999^100 ≈ 1% total error — enough to │
│ flip classification decisions silently │
└──────────────────────────────────────────────────────┘
Sources of Numerical Divergence
═══════════════════════════════
1. Floating-point non-associativity:
(a + b) + c ≠ a + (b + c) in IEEE 754
Example: sum([1e8, 1.0, -1e8])
Left-to-right: (1e8 + 1.0) + (-1e8) = 0.0 ← WRONG
Right-to-left: 1e8 + (1.0 + (-1e8)) = 1.0 ← CORRECT
2. Reduction order changes after tiling/parallelization:
Sequential: s = x[0] + x[1] + x[2] + ... + x[N]
Tiled: s = (x[0]+...+x[31]) + (x[32]+...+x[63]) + ...
→ Different rounding at each partial sum
3. FMA (fused multiply-add) availability:
With FMA: a*b + c (single rounding)
Without: tmp = a*b (round), tmp + c (round again)
→ Results differ by up to 1 ULP
4. Mixed precision accumulation:
FP16 input × FP16 input → FP32 accumulator → FP16 output
vs
FP16 input × FP16 input → FP16 accumulator → FP16 output
→ Catastrophic difference for large reductions
The relative error from these sources follows:
$$\epsilon_{\text{total}} \leq n \cdot \epsilon_{\text{machine}} \cdot \kappa(A)$$
where $n$ is the number of operations, $\epsilon_{\text{machine}}$ is machine epsilon ($\approx 5.96 \times 10^{-8}$ for FP32, $\approx 9.77 \times 10^{-4}$ for FP16), and $\kappa(A)$ is the condition number of the computation.
allclose Rightallclose FunctionThe standard numerical comparison checks:
$$|a - b| \leq \text{atol} + \text{rtol} \cdot |b|$$
import numpy as np
def smart_allclose(actual, expected, dtype="float32"):
"""Choose tolerances based on dtype and operation type."""
tolerances = {
# (atol, rtol)
"float64": (1e-12, 1e-10),
"float32": (1e-5, 1e-5),
"float16": (1e-2, 1e-2),
"bfloat16": (1e-1, 1e-1), # BF16 has only 7-bit mantissa
"int8": (1, 0), # Integer: exact or ±1
"int32": (0, 0), # Integer: must be exact
}
atol, rtol = tolerances.get(dtype, (1e-5, 1e-5))
return np.allclose(actual, expected, atol=atol, rtol=rtol)
Tolerance Decision Matrix
═════════════════════════
Operation │ float32 atol │ float16 atol │ Why
───────────────────┼──────────────┼──────────────┼──────────────────
Element-wise (relu)│ 1e-7 │ 1e-3 │ No accumulation
MatMul (small K) │ 1e-5 │ 5e-2 │ K additions
MatMul (K=4096) │ 1e-4 │ 5e-1 │ Many additions
Convolution │ 1e-4 │ 5e-2 │ K*R*S additions
BatchNorm │ 1e-4 │ 1e-1 │ Variance computation
Softmax │ 1e-5 │ 5e-2 │ exp + reduce
LayerNorm │ 1e-4 │ 1e-1 │ Mean + variance
Full model (e2e) │ 1e-3 │ 5e-1 │ Error compounds
Rule of thumb for FP32:
atol ≈ sqrt(K) × 1e-7 where K = number of accumulated terms
For FP16 with FP32 accumulation:
atol ≈ sqrt(K) × 1e-4
# PITFALL 1: Using default tolerances for all dtypes
# BAD:
np.testing.assert_allclose(fp16_result, reference) # atol=0, rtol=1e-7
# → Almost always fails for FP16!
# GOOD:
np.testing.assert_allclose(fp16_result, reference, atol=1e-2, rtol=1e-2)
# PITFALL 2: Comparing against wrong reference
# BAD: comparing TVM output against PyTorch output
# (both are approximate — which is "correct"?)
# GOOD: compare against high-precision reference
ref_fp64 = compute_reference(inputs.astype("float64")).astype("float32")
# PITFALL 3: Ignoring output scale
# BAD: fixed atol for outputs of different magnitudes
# Consider: output values range from 1e-6 to 1e6
np.testing.assert_allclose(result, ref, atol=1e-5)
# → Passes for large values, fails for small values
# GOOD: use relative tolerance + small absolute tolerance
np.testing.assert_allclose(result, ref, rtol=1e-5, atol=1e-7)
ML Compiler Testing Pyramid
════════════════════════════
┌──────────────┐
│ Model-Level │ Full ResNet/BERT: slow but
│ (E2E) │ catches integration bugs
└──────┬───────┘
│
┌─────────▼──────────┐
│ Operator-Level │ Single conv2d/matmul:
│ (Integration) │ catches numerical bugs
└─────────┬──────────┘
│
┌──────────────▼──────────────┐
│ Pass-Level (Unit) │ Single pass on toy IR:
│ │ catches transform bugs
└──────────────┬──────────────┘
│
┌──────────────────▼──────────────────┐
│ IR Construction (Unit) │ Build/validate IR nodes:
│ │ catches API/type bugs
└─────────────────────────────────────┘
Coverage target per level:
• IR construction: >90% (fast, cheap)
• Pass-level: >80% per pass
• Operator-level: all ops × 2+ dtypes
• Model-level: top-10 models (ResNet, BERT, GPT-2, etc.)
import tvm
from tvm import relay
def test_fuse_ops_preserves_semantics():
"""FuseOps must not change computation results."""
# Build a graph: conv2d → relu → conv2d → relu
data = relay.var("data", shape=(1, 3, 32, 32), dtype="float32")
w1 = relay.var("w1", shape=(16, 3, 3, 3), dtype="float32")
w2 = relay.var("w2", shape=(32, 16, 3, 3), dtype="float32")
conv1 = relay.nn.conv2d(data, w1, padding=(1, 1))
relu1 = relay.nn.relu(conv1)
conv2 = relay.nn.conv2d(relu1, w2, padding=(1, 1))
relu2 = relay.nn.relu(conv2)
mod_before = tvm.IRModule.from_expr(relu2)
# Run FuseOps
mod_after = relay.transform.FuseOps(fuse_opt_level=2)(mod_before)
# Compile and run both versions
np_data = np.random.uniform(-1, 1, (1, 3, 32, 32)).astype("float32")
np_w1 = np.random.uniform(-0.1, 0.1, (16, 3, 3, 3)).astype("float32")
np_w2 = np.random.uniform(-0.1, 0.1, (32, 16, 3, 3)).astype("float32")
def run_mod(mod):
with tvm.transform.PassContext(opt_level=0): # no extra opts
lib = relay.build(mod, target="llvm")
rt = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.cpu()))
rt.set_input("data", np_data)
rt.set_input("w1", np_w1)
rt.set_input("w2", np_w2)
rt.run()
return rt.get_output(0).numpy()
out_before = run_mod(mod_before)
out_after = run_mod(mod_after)
# Must be bitwise identical (same dtype, same target)
np.testing.assert_array_equal(out_before, out_after)
@tvm.testing.parametrize_targets("llvm", "cuda")
def test_matmul_against_numpy(target, dev):
"""Test matmul correctness across backends against NumPy reference."""
M, K, N = 128, 256, 64
a = relay.var("a", shape=(M, K), dtype="float32")
b = relay.var("b", shape=(K, N), dtype="float32")
out = relay.nn.dense(a, relay.op.transpose(b))
mod = tvm.IRModule.from_expr(out)
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target)
rt = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))
a_np = np.random.uniform(-1, 1, (M, K)).astype("float32")
b_np = np.random.uniform(-1, 1, (K, N)).astype("float32")
rt.set_input("a", a_np)
rt.set_input("b", b_np)
rt.run()
tvm_out = rt.get_output(0).numpy()
# NumPy reference (uses higher-precision accumulation internally)
ref = a_np @ b_np
# Tolerance scales with K (number of accumulated terms)
atol = np.sqrt(K) * 1e-6
np.testing.assert_allclose(tvm_out, ref, atol=atol, rtol=1e-5)
Fuzzing generates random (but valid) programs to find crashes and miscompilations:
Fuzzing Strategy for ML Compilers
══════════════════════════════════
Graph Fuzzer: generates random Relay/Relax programs
┌────────────────────────────────────────────────────┐
│ 1. Pick random ops (conv2d, relu, add, reshape) │
│ 2. Connect with valid shapes │
│ 3. Assign random dtypes │
│ 4. Compile with different opt_levels │
│ 5. Compare outputs against reference │
└────────────────────────────────────────────────────┘
Schedule Fuzzer: generates random TIR schedules
┌────────────────────────────────────────────────────┐
│ 1. Take known-correct TIR PrimFunc │
│ 2. Apply random schedule transforms │
│ (split, reorder, vectorize, parallelize) │
│ 3. Compile and run │
│ 4. Compare output against unscheduled version │
└────────────────────────────────────────────────────┘
What fuzzing catches:
✓ Segfaults in generated code (bounds violations)
✓ Assertion failures in compilation passes
✓ Shape inference bugs (invalid intermediate shapes)
✓ Memory layout mismatches between passes
✓ Numerical divergence from reference
import random
import tvm
from tvm import relay
import numpy as np
def generate_random_graph(depth=5, seed=42):
"""Generate a random but valid Relay graph for fuzz testing."""
rng = random.Random(seed)
shape = (1, rng.choice([3, 16, 32, 64]), 8, 8)
x = relay.var("input", shape=shape, dtype="float32")
current = x
channels = shape[1]
for _ in range(depth):
op = rng.choice(["conv2d", "relu", "add_const", "pool"])
if op == "conv2d":
out_c = rng.choice([16, 32, 64])
w = relay.var(f"w_{rng.randint(0,999)}", shape=(out_c, channels, 3, 3))
current = relay.nn.conv2d(current, w, padding=(1, 1), channels=out_c)
channels = out_c
elif op == "relu":
current = relay.nn.relu(current)
elif op == "add_const":
bias = relay.const(np.random.randn(1, channels, 1, 1).astype("float32"))
current = relay.add(current, bias)
elif op == "pool":
current = relay.nn.avg_pool2d(current, pool_size=(2, 2),
strides=(1, 1), padding=(0, 0))
return relay.Function(relay.analysis.free_vars(current), current)
def fuzz_compile_and_check(n_programs=100):
"""Fuzz-test the compiler with random programs."""
failures = []
for seed in range(n_programs):
try:
func = generate_random_graph(depth=4, seed=seed)
mod = tvm.IRModule.from_expr(func)
# Compile at max optimization
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target="llvm")
# Also compile at opt_level=0 and compare
with tvm.transform.PassContext(opt_level=0):
lib_ref = relay.build(mod, target="llvm")
# If both compile, check outputs match
# (exercise left to reader to fill in runtime comparison)
except Exception as e:
failures.append((seed, str(e)))
print(f"Fuzzed {n_programs} programs, {len(failures)} failures")
for seed, err in failures[:5]:
print(f" Seed {seed}: {err[:80]}")
Differential testing compiles the same program for different targets and compares:
Differential Testing Architecture
══════════════════════════════════
Input: Relay model + random test data
│
┌────▼────────────────────────────────────────────┐
│ Same Relay IR (frozen) │
└────┬──────────┬──────────┬──────────┬───────────┘
│ │ │ │
┌────▼───┐ ┌────▼───┐ ┌────▼───┐ ┌────▼───┐
│ LLVM │ │ CUDA │ │ Vulkan │ │ OpenCL │
│ (CPU) │ │ (GPU) │ │ (GPU) │ │ (GPU) │
└────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘
│ │ │ │
┌────▼───┐ ┌────▼───┐ ┌────▼───┐ ┌────▼───┐
│ out_0 │ │ out_1 │ │ out_2 │ │ out_3 │
└────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘
│ │ │ │
└──────────┴──────────┴──────────┘
│
┌────▼────────────────────────┐
│ Pairwise allclose checks │
│ with dtype-appropriate tol │
│ │
│ If ANY pair disagrees: │
│ → Bug in one of the backends│
│ → Use CPU as reference │
└─────────────────────────────┘
import tvm
from tvm import relay
import numpy as np
def differential_test_model(mod, params, targets, test_inputs):
"""Run the same model on multiple backends and compare outputs."""
outputs = {}
for target_str in targets:
target = tvm.target.Target(target_str)
try:
dev = tvm.device(target_str, 0)
if not dev.exist:
continue
except Exception:
continue
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
rt = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))
for name, data in test_inputs.items():
rt.set_input(name, data)
rt.run()
outputs[target_str] = rt.get_output(0).numpy()
# Pairwise comparison (use CPU as reference)
ref_target = "llvm"
if ref_target not in outputs:
return # Can't test without CPU baseline
ref = outputs[ref_target]
for target_str, out in outputs.items():
if target_str == ref_target:
continue
max_diff = np.max(np.abs(out - ref))
rel_diff = np.max(np.abs(out - ref) / (np.abs(ref) + 1e-10))
print(f"{ref_target} vs {target_str}: "
f"max_abs_diff={max_diff:.2e}, max_rel_diff={rel_diff:.2e}")
np.testing.assert_allclose(
out, ref, atol=1e-4, rtol=1e-4,
err_msg=f"Mismatch between {ref_target} and {target_str}"
)
# Usage:
# differential_test_model(
# mod, params,
# targets=["llvm", "cuda", "vulkan"],
# test_inputs={"data": np.random.randn(1, 3, 224, 224).astype("float32")}
# )
import tvm.testing
# 1. Parametrize across targets (skips unavailable ones)
@tvm.testing.parametrize_targets("llvm", "cuda", "vulkan")
def test_my_op(target, dev):
pass # dev is automatically created for target
# 2. Check numerical equality with good defaults
tvm.testing.assert_allclose(actual, expected, rtol=1e-5, atol=1e-5)
# 3. Known failure marking (for work-in-progress)
@tvm.testing.known_failing_targets("vulkan")
def test_broken_on_vulkan(target, dev):
pass
# 4. Requires specific hardware
@tvm.testing.requires_cuda
def test_cuda_specific():
pass
@tvm.testing.requires_gpu
def test_any_gpu():
pass
# 5. Fixture for common model loading
@tvm.testing.parametrize_targets
def test_resnet(target, dev):
mod, params = relay.testing.resnet.get_workload(
num_layers=18, batch_size=1
)
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target, params=params)
# ... run and validate
TVM CI Matrix
═════════════
Job │ What it tests │ Time
────────────────────┼──────────────────────────────────┼──────
lint │ clang-format, pylint, mypy │ ~3 min
build-cpu │ cmake + ninja (CPU-only) │ ~10 min
build-gpu │ cmake + ninja (CUDA enabled) │ ~15 min
test-cpu │ pytest tests/ (no GPU required) │ ~30 min
test-gpu │ pytest tests/ (CUDA + cuDNN) │ ~45 min
test-arm │ Cross-compile + QEMU │ ~20 min
docs │ Sphinx build + link check │ ~10 min
────────────────────┼──────────────────────────────────┼──────
Total (sequential) │ │ ~2.5 hr
Total (parallel CI) │ │ ~50 min
import numpy as np
# Compute a large reduction in float32 and float16
K = 4096
a = np.random.randn(K).astype("float32")
sum_sequential = np.float32(0.0)
for x in a:
sum_sequential += np.float32(x)
sum_numpy = np.sum(a) # uses pairwise summation internally
print(f"Sequential sum: {sum_sequential}")
print(f"NumPy sum: {sum_numpy}")
print(f"Abs difference: {abs(sum_sequential - sum_numpy)}")
# Repeat in float16 — observe much larger divergence
a16 = a.astype("float16")
sum16_seq = np.float16(0.0)
for x in a16:
sum16_seq += np.float16(x)
sum16_np = np.sum(a16)
print(f"\nFP16 sequential: {sum16_seq}")
print(f"FP16 NumPy: {sum16_np}")
print(f"FP16 abs diff: {abs(float(sum16_seq) - float(sum16_np))}")
Take any Relay model (use relay.testing.resnet.get_workload) and:
1. Compile for "llvm" and "llvm -mcpu=skylake-avx512" (different instruction sets)
2. Run both with the same random input
3. Compare outputs — are they identical? Why or why not?
Extend the graph fuzzer from Section 4:
- Add batch_norm, sigmoid, and concat ops
- Track the maximum absolute difference between opt_level=0 and opt_level=3
- Report which random seeds produce the largest divergence
tvm.testing module provides parametric target testing, hardware guards, and numerical comparison utilities out of the boxDay 49 is a consolidation day marking the end of Phase III. You'll build a full concept map of the TVM stack, create a comparison matrix (TVM vs XLA vs Triton vs ORT vs MLIR ecosystem), take a self-assessment quiz, and verify you're ready for Phase IV's focus on inference optimization and deployment.