Phase III · Week 7 · Day 45 of 70 · 2.5 hours
"The best deployment framework isn't the one with the most optimizations — it's the one that runs everywhere your models need to go."
| ← Previous | Next → | 📅 Week | 🔷 Phase | 📚 Curriculum |
|---|---|---|---|---|
| Day 44: XLA & StableHLO | Day 46: MLC-LLM | Week 7: TVM Advanced & MLC | Phase III: Apache TVM Deep Dive | ML Compilers |
ONNX Runtime (ORT) is the most widely deployed ML inference engine in production. It powers inference at Microsoft (Azure ML, Office, Bing, Xbox), Hugging Face, and thousands of companies. While TVM and XLA focus on compilation to optimized kernels, ORT takes a pragmatic approach: partition the model graph across the best available backend for each subgraph — CUDA for some ops, TensorRT for others, CPU for the rest. This "best tool for each op" philosophy, combined with the ONNX standard format, makes ORT the go-to for deploying trained models. Understanding ORT's architecture reveals a different design philosophy from compiler-centric approaches.
ONNX Runtime Architecture
══════════════════════════
┌─────────────────────────────────────────────┐
│ ONNX Model (.onnx) │
│ (framework-agnostic graph + weights) │
└────────────────────┬────────────────────────┘
│
┌────────────────────▼────────────────────────┐
│ Graph Optimization │
│ Level 1: Basic (const folding, dead code) │
│ Level 2: Extended (fusion, layout) │
│ Level 99: Layout optimization │
└────────────────────┬────────────────────────┘
│
┌────────────────────▼────────────────────────┐
│ Graph Partitioning │
│ Assign subgraphs to Execution Providers │
│ │
│ ┌──────┐ ┌──────────┐ ┌────────┐ ┌──────┐ │
│ │TensoRT│ │ CUDA │ │OpenVINO│ │ CPU │ │
│ │ EP │ │ EP │ │ EP │ │ EP │ │
│ └──┬───┘ └────┬─────┘ └───┬────┘ └──┬───┘ │
└─────┼──────────┼───────────┼─────────┼─────┘
│ │ │ │
┌─────▼──────────▼───────────▼─────────▼─────┐
│ Execution Engine │
│ • Sequential or parallel execution │
│ • Memory planning (arena allocator) │
│ • Kernel dispatch │
└──────────────────────────────────────────────┘
| Principle | Implementation |
|---|---|
| Format-first | ONNX is the input — no tracing or re-compilation |
| Best backend per op | Graph partitioning across Execution Providers |
| Incremental optimization | Three levels of graph transforms |
| Zero-copy where possible | Arena allocator with memory reuse patterns |
| Backward compatible | Old models run on new ORT versions |
ORT applies graph transformations in three levels, from safe to aggressive:
import onnxruntime as ort
# Configure optimization level
options = ort.SessionOptions()
# Level 0: No optimization (debugging)
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
# Level 1: Basic (always safe)
# • Constant folding
# • Dead node elimination
# • Redundant node elimination
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
# Level 2: Extended (may change numerics slightly)
# • Operator fusion (Conv+BN, MatMul+Add, etc.)
# • Attention fusion
# • GELU fusion
# • Layer normalization fusion
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
# Level 99: All optimizations including layout transforms
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", options)
ORT Fusion Patterns (selected)
══════════════════════════════
1. Conv + BatchNorm → FusedConv
┌──────┐ ┌────┐ ┌──────────┐
│ Conv │──→│ BN │ ═══→ │ FusedConv│ (BN folded into Conv)
└──────┘ └────┘ └──────────┘
2. MatMul + Add → FusedMatMul (Gemm)
┌────────┐ ┌─────┐ ┌──────────────┐
│ MatMul │──→│ Add │ ═══→ │ FusedGemm │
└────────┘ └─────┘ │ (α·A·B + β·C)│
└──────────────┘
3. Multi-Head Attention Fusion
┌────┐ ┌────┐ ┌────┐
│ Wq │ │ Wk │ │ Wv │
└──┬─┘ └──┬─┘ └──┬─┘ ┌──────────────────┐
│ │ │ ═══→ │ MultiHeadAttention│
┌──▼──────▼──────▼──┐ │ (single fused op) │
│ QKV + softmax + │ └──────────────────┘
│ weighted sum │
└────────────────────┘
4. LayerNorm Fusion
┌──────┐ ┌──────┐ ┌──────┐
│ReduceMean │→│ Sub │→│ Pow │→... ═══→ ┌───────────┐
│ │→│ Div │→│ Mul │→│ Add │ │ LayerNorm │
└──────┘ └─────┘ └─────┘ └─────┘ └───────────┘
5. GELU Approximation Fusion
x * 0.5 * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
═══→ BiasGelu(x, bias)
Execution Providers are pluggable backends. ORT queries them in priority order — each EP claims the subgraphs it can handle:
Graph Partitioning with Execution Providers
════════════════════════════════════════════
Full Model Graph:
┌────┐ ┌────┐ ┌────────┐ ┌──────┐ ┌────┐
│Conv│──→│BN │──→│Attention│──→│Custom│──→│Soft│
└────┘ └────┘ └────────┘ │ Op │ │max │
└──────┘ └────┘
EP Priority: TensorRT > CUDA > CPU
TensorRT claims: Conv+BN (as fused engine)
CUDA claims: Attention, Softmax
CPU fallback: Custom Op (not supported by GPU EPs)
Result:
┌─── TensorRT EP ───┐ ┌──── CUDA EP ────┐ ┌─ CPU EP ─┐
│ Conv+BN (engine) │→│ Attention, Soft │→│ CustomOp │
└────────────────────┘ └─────────────────┘ └──────────┘
GPU memory GPU memory CPU mem
↕ data transfer ↕
| EP | Target Hardware | Key Advantage |
|---|---|---|
CPUExecutionProvider |
Any CPU | Always available, baseline |
CUDAExecutionProvider |
NVIDIA GPU | cuDNN/cuBLAS integration |
TensorrtExecutionProvider |
NVIDIA GPU | INT8/FP16, layer fusion engine |
OpenVINOExecutionProvider |
Intel CPU/GPU/VPU | Intel-optimized, edge devices |
DirectMLExecutionProvider |
Windows GPU | Any DirectX 12 GPU |
CoreMLExecutionProvider |
Apple Silicon | Neural Engine access |
QNNExecutionProvider |
Qualcomm NPU | Mobile (Snapdragon) |
ROCmExecutionProvider |
AMD GPU | MIOpen/rocBLAS |
AzureExecutionProvider |
Azure cloud | Remote inference |
import onnxruntime as ort
# Priority list: try TensorRT first, fall back to CUDA, then CPU
providers = [
('TensorrtExecutionProvider', {
'trt_max_workspace_size': 2 * 1024 * 1024 * 1024, # 2GB
'trt_fp16_enable': True,
'trt_engine_cache_enable': True,
'trt_engine_cache_path': './trt_cache/',
}),
('CUDAExecutionProvider', {
'device_id': 0,
'arena_extend_strategy': 'kNextPowerOfTwo',
'gpu_mem_limit': 4 * 1024 * 1024 * 1024, # 4GB
'cudnn_conv_algo_search': 'EXHAUSTIVE',
}),
'CPUExecutionProvider',
]
session = ort.InferenceSession("model.onnx", providers=providers)
# Check which EP each node is assigned to
for node in session.get_providers():
print(node)
ORT uses an arena-based allocator to minimize memory allocation overhead:
ORT Memory Arena
════════════════
Pool Structure (pre-allocated chunks):
┌──────────────────────────────────────┐
│ Arena (e.g., 256 MB) │
│ ┌────┐┌────────┐┌──┐┌──────┐┌────┐ │
│ │used││ free ││us││ free ││used│ │
│ │ 4MB││ 12MB ││2M││ 8MB ││ 4MB│ │
│ └────┘└────────┘└──┘└──────┘└────┘ │
└──────────────────────────────────────┘
Allocation strategy:
1. Best-fit from existing free blocks
2. If no fit: extend arena (kNextPowerOfTwo)
3. Freed blocks are returned to pool, not to OS
Memory Reuse Patterns:
┌──────────────────────────────────────────┐
│ Execution order: A → B → C → D → E │
│ │
│ A's buffer: [████] │
│ B's buffer: [████████] │
│ C reuses A: [████] ← same addr │
│ D's buffer: [████] │
│ E reuses B: [████████]← reused │
└──────────────────────────────────────────┘
options = ort.SessionOptions()
# Enable memory pattern optimization
options.enable_mem_pattern = True
# Enable memory arena shrinkage (reduce peak memory)
options.enable_mem_reuse = True
# Set execution mode
options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL # or ORT_PARALLEL
# Limit intra-op threads (e.g., for server deployment)
options.intra_op_num_threads = 4
options.inter_op_num_threads = 2
ORT Approach TVM Approach
════════════ ════════════
Model.onnx Model (any framework)
│ │
┌───▼────────┐ ┌─────▼──────┐
│ Graph Opt │ │ Import to │
│ (rewrites) │ │ Relay/Relax │
└───┬────────┘ └─────┬──────┘
│ │
┌───▼────────┐ ┌─────▼──────┐
│ Partition │ │ Compile + │
│ to EPs │ vs. │ Auto-Tune │
└───┬────────┘ └─────┬──────┘
│ │
┌───▼────────┐ ┌─────▼──────┐
│ Call vendor │ │ Generate │
│ libraries │ │ custom │
│ (cuDNN, │ │ kernels │
│ TensorRT) │ │ │
└────────────┘ └────────────┘
ORT: "Best library for each op"
TVM: "Best generated kernel for each op"
| Scenario | Choose ORT | Choose TVM |
|---|---|---|
| Quick deployment | ✅ Drop in ONNX, done | Needs tuning time |
| NVIDIA GPU | ✅ cuDNN/TRT tuned | Good, less mature |
| Custom hardware | ❌ Needs EP impl | ✅ Write schedule |
| Edge / MCU | Limited (mobile EPs) | ✅ µTVM, bare metal |
| Transformer models | ✅ Built-in attention fusion | ✅ Auto-tuned |
| Model variety | ✅ 300+ ONNX ops | Some ops need impl |
| Latency-critical | ✅ TRT EP very fast | ✅ After tuning |
| Cross-platform | ✅ Windows/Linux/Mac | ✅ Linux-focused |
import torch
import torch.nn as nn
import onnxruntime as ort
import numpy as np
# 1. Define a model
class TransformerBlock(nn.Module):
def __init__(self, d_model=256, nhead=8):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, nhead, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model),
)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
attn_out, _ = self.attn(x, x, x)
x = self.norm1(x + attn_out)
x = self.norm2(x + self.ff(x))
return x
model = TransformerBlock().eval()
dummy = torch.randn(1, 16, 256)
# 2. Export to ONNX
torch.onnx.export(
model, dummy, "transformer_block.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch", 1: "seq_len"}},
opset_version=17,
)
# 3. Load with different optimization levels and compare
for level_name, level in [
("DISABLED", ort.GraphOptimizationLevel.ORT_DISABLE_ALL),
("BASIC", ort.GraphOptimizationLevel.ORT_ENABLE_BASIC),
("EXTENDED", ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED),
("ALL", ort.GraphOptimizationLevel.ORT_ENABLE_ALL),
]:
opts = ort.SessionOptions()
opts.graph_optimization_level = level
# Save optimized model to inspect changes
opts.optimized_model_filepath = f"optimized_{level_name}.onnx"
sess = ort.InferenceSession("transformer_block.onnx", opts,
providers=["CPUExecutionProvider"])
print(f"{level_name}: loaded successfully")
# 4. Compare: count nodes in original vs optimized
import onnx
for name in ["transformer_block", "optimized_ALL"]:
m = onnx.load(f"{name}.onnx")
print(f"{name}: {len(m.graph.node)} nodes")
import onnxruntime as ort
import numpy as np
import json
# Enable profiling
options = ort.SessionOptions()
options.enable_profiling = True
options.profile_file_prefix = "ort_profile"
session = ort.InferenceSession(
"transformer_block.onnx", options,
providers=["CPUExecutionProvider"]
)
# Run inference
input_data = np.random.randn(4, 32, 256).astype(np.float32)
for _ in range(10):
session.run(None, {"input": input_data})
# Get profiling results
profile_file = session.end_profiling()
print(f"Profile saved to: {profile_file}")
# Parse and display top-10 slowest ops
with open(profile_file) as f:
events = json.load(f)
kernel_events = [e for e in events if e.get("cat") == "Node"]
kernel_events.sort(key=lambda e: e.get("dur", 0), reverse=True)
print("\nTop 10 slowest operations:")
print(f"{'Op Name':<40} {'Duration (µs)':>12}")
print("-" * 54)
for e in kernel_events[:10]:
print(f"{e['name']:<40} {e['dur']:>12}")
import onnxruntime as ort
import numpy as np
# Check available execution providers
print("Available EPs:", ort.get_available_providers())
# Compare performance across EPs
model_path = "transformer_block.onnx"
input_data = np.random.randn(8, 64, 256).astype(np.float32)
for ep in ort.get_available_providers():
try:
session = ort.InferenceSession(model_path, providers=[ep])
# Warmup
for _ in range(5):
session.run(None, {"input": input_data})
# Benchmark
import time
t0 = time.perf_counter()
for _ in range(100):
session.run(None, {"input": input_data})
elapsed = (time.perf_counter() - t0) / 100
print(f"{ep:<35} {elapsed*1000:.2f} ms/inference")
except Exception as e:
print(f"{ep:<35} FAILED: {e}")
Day 46 explores MLC-LLM — the TVM team's project for compiling large language models to run everywhere, from NVIDIA GPUs to iPhones to web browsers. You'll see how TVM's Relax IR, auto-tuning, and quantization combine to make Llama and other LLMs portable across any hardware.