Phase VII — VLAs: Architecture to Deployment | Week 16 | 2.5 hours "A 7B parameter model running at 3 Hz is a research demo. The same model running at 20 Hz on edge hardware is a product." — Deployment Reality
Total control loop: ≤ 50ms (20 Hz) for manipulation
≤ 100ms (10 Hz) for mobile manipulation
Budget breakdown:
Image capture + transfer: 5 ms
Vision encoder: 10 ms ← Bottleneck #1
VLM reasoning: 20 ms ← Bottleneck #2
Action head: 5 ms
Safety check: 2 ms
Communication: 3 ms
Motor command: 5 ms
────────────────────────────────────
Total: 50 ms
| Technique | Compression | Speed Gain | Quality Loss | Best For |
|---|---|---|---|---|
| FP16 | 2× | 1.5-2× | Negligible | Always use |
| INT8 (PTQ) | 4× | 2-3× | <1% | Standard deployment |
| INT4 (GPTQ) | 8× | 3-5× | 2-5% | Edge devices |
| Pruning | 2-5× | 2-4× | 1-3% | Structured sparsity |
| Distillation | 10-50× | 5-20× | 3-10% | Dedicated student |
| ONNX + TensorRT | — | 2-5× | None | NVIDIA hardware |
The vision encoder is often the bottleneck:
# Strategy 1: Reduce resolution
# ViT-Large at 224×224: ~15ms
# ViT-Large at 128×128: ~6ms (2.5× faster, ~3% quality drop)
# Strategy 2: Smaller backbone
# ViT-Large (307M): ~15ms
# ViT-Base (86M): ~5ms
# ViT-Small (22M): ~2ms
# MobileNetV3: ~1ms
# Strategy 3: Cache when scene is static
# Key insight: if no hand/object movement in frame,
# vision features are the same as last step.
# Check: ||frame_t - frame_{t-1}|| < threshold
# If static: reuse cached features (0ms vs 15ms)
Tier 1: Cloud GPU (best quality, highest latency)
NVIDIA A100/H100 — unlimited compute
Latency: 5-20ms model + 10-50ms network
Best for: fleet management, offline training
Tier 2: On-robot GPU (balanced)
NVIDIA Jetson Orin (275 TOPS) — $1500
Latency: 15-50ms
Best for: autonomous robots
Tier 3: Edge TPU / NPU (low power)
Google Edge TPU (4 TOPS) — $75
Intel Myriad X (1 TOPS) — $50
Latency: 20-100ms (small models only)
Best for: simple perception + small action head
Tier 4: Custom ASIC (future)
Specialized VLA accelerator
Latency: <5ms (theoretical)
Best for: mass-produced robots
Option A: On-robot inference
┌──────────────────────────┐
│ Robot (Jetson Orin) │
│ ┌──────────────────┐ │
│ │ VLA model (INT8) │ │
│ │ TensorRT runtime │ │
│ └──────────────────┘ │
│ Latency: 30-50ms │
│ No network dependency │
└──────────────────────────┘
Option B: Cloud inference
┌──────────┐ WiFi/5G ┌──────────┐
│ Robot │ ←─────────→ │ Cloud │
│ Camera │ 20-50ms │ A100 │
│ Motors │ network │ VLA │
└──────────┘ └──────────┘
Latency: 40-100ms (network dependent)
Option C: Hybrid (recommended)
┌───────────────────────────────┐
│ Robot (Jetson) │
│ ┌──────────────────┐ │
│ │ Small action │←─ VLM │ ← Cloud
│ │ head (local) │ feats │ (async)
│ └──────────────────┘ │
│ Latency: 10ms local │
│ VLM updates: 200ms async │
└───────────────────────────────┘
import torch
import time
import numpy as np
class LatencyProfiler:
"""Profile VLA inference latency."""
def __init__(self, model, device="cuda"):
self.model = model.to(device)
self.device = device
def profile(self, input_shape, n_warmup=10, n_runs=100):
"""Measure inference latency."""
dummy_input = torch.randn(*input_shape, device=self.device)
# Warmup
with torch.no_grad():
for _ in range(n_warmup):
self.model(dummy_input)
# Synchronize GPU
if self.device == "cuda":
torch.cuda.synchronize()
# Timed runs
latencies = []
with torch.no_grad():
for _ in range(n_runs):
if self.device == "cuda":
torch.cuda.synchronize()
start = time.perf_counter()
self.model(dummy_input)
if self.device == "cuda":
torch.cuda.synchronize()
end = time.perf_counter()
latencies.append((end - start) * 1000) # ms
return {
"mean_ms": np.mean(latencies),
"p50_ms": np.percentile(latencies, 50),
"p95_ms": np.percentile(latencies, 95),
"p99_ms": np.percentile(latencies, 99),
"std_ms": np.std(latencies),
"max_hz": 1000 / np.percentile(latencies, 95),
}
class ModelQuantizer:
"""Quantize model for deployment."""
@staticmethod
def to_fp16(model):
"""Convert to FP16."""
return model.half()
@staticmethod
def dynamic_int8(model):
"""Dynamic INT8 quantization (PyTorch)."""
return torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
@staticmethod
def export_onnx(model, dummy_input, path="model.onnx"):
"""Export to ONNX for TensorRT/other runtimes."""
torch.onnx.export(
model, dummy_input, path,
input_names=["observation"],
output_names=["action"],
dynamic_axes={"observation": {0: "batch"},
"action": {0: "batch"}},
opset_version=14,
)
print(f"Exported to {path}")
class VisionCacheManager:
"""Cache vision encoder outputs when scene is static."""
def __init__(self, encoder, change_threshold=0.02):
self.encoder = encoder
self.threshold = change_threshold
self.cached_features = None
self.cached_frame = None
self.cache_hits = 0
self.total_calls = 0
@torch.no_grad()
def encode(self, frame):
self.total_calls += 1
if self.cached_frame is not None:
# Check if scene changed
diff = (frame - self.cached_frame).abs().mean().item()
if diff < self.threshold:
self.cache_hits += 1
return self.cached_features
# Scene changed or first call: compute features
self.cached_features = self.encoder(frame)
self.cached_frame = frame.clone()
return self.cached_features
@property
def cache_hit_rate(self):
return self.cache_hits / max(1, self.total_calls)
class HybridInferenceServer:
"""Hybrid local + cloud inference."""
def __init__(self, local_action_head, vision_cache, cloud_interval_ms=200):
self.action_head = local_action_head
self.vision_cache = vision_cache
self.cloud_interval = cloud_interval_ms / 1000
self.last_cloud_time = 0
self.vlm_features = None
def local_step(self, proprio):
"""Fast local action prediction (~5ms)."""
if self.vlm_features is None:
return None
with torch.no_grad():
action = self.action_head(
torch.cat([self.vlm_features, proprio.unsqueeze(0)], dim=-1)
)
return action
def cloud_update(self, frame, instruction):
"""Async VLM update from cloud (~200ms, non-blocking)."""
# In practice: send to cloud, receive features asynchronously
# Simulated here:
self.vlm_features = self.vision_cache.encode(frame)
self.last_cloud_time = time.time()
def should_update_cloud(self):
return (time.time() - self.last_cloud_time) > self.cloud_interval
# Demo: Profiling
print("=== Model Size Comparison ===")
sizes = {
"ViT-Small (22M)": torch.nn.Sequential(
torch.nn.Linear(512, 384), torch.nn.ReLU(), torch.nn.Linear(384, 256)),
"ViT-Base (86M)": torch.nn.Sequential(
torch.nn.Linear(512, 768), torch.nn.ReLU(), torch.nn.Linear(768, 256)),
"ViT-Large (307M)": torch.nn.Sequential(
torch.nn.Linear(512, 1024), torch.nn.ReLU(), torch.nn.Linear(1024, 256)),
}
for name, model in sizes.items():
params = sum(p.numel() for p in model.parameters())
profiler = LatencyProfiler(model, device="cpu")
result = profiler.profile((1, 512), n_warmup=5, n_runs=50)
print(f"{name}: {params:,} params, {result['mean_ms']:.2f}ms, "
f"{result['max_hz']:.0f} Hz")
# Vision cache demo
encoder = torch.nn.Linear(512, 256)
cache = VisionCacheManager(encoder, change_threshold=0.01)
frame = torch.randn(1, 512)
_ = cache.encode(frame)
_ = cache.encode(frame + torch.randn(1, 512) * 0.001) # Static → cache hit
_ = cache.encode(frame + torch.randn(1, 512) * 0.1) # Changed → recompute
print(f"\nVision cache hit rate: {cache.cache_hit_rate:.0%}")
Quantization comparison: Take a VLA model. Compare FP32, FP16, INT8, and INT4 quantization. Measure latency and action prediction accuracy on 1000 test samples.
Resolution vs quality: Train the same model at 224×224, 160×160, 128×128, and 96×96 input resolution. Plot success rate vs latency. Find the Pareto-optimal point.
Vision caching: On a real manipulation trajectory, measure what percentage of consecutive frames have <2% pixel difference. What's the effective speedup from caching?
Hybrid architecture: Simulate a hybrid system: VLM updates at 5 Hz, local action head at 50 Hz. Compare with VLA-only at 10 Hz. Which achieves better task success?
Latency and compute are necessary but not sufficient. Tomorrow: deployment safety and monitoring — the systems that prevent your VLA from breaking things, and the dashboards that tell you when something's wrong.