Phase VII — VLAs: Architecture to Deployment | Week 16 | 3 hours "The end of the curriculum is the beginning of the craft. Evaluate what you've built, reflect on what you've learned, and chart the road ahead."
Run your Day 111 VLA through these evaluation suites:
import torch
import numpy as np
from collections import defaultdict
class VLAEvaluator:
"""Comprehensive VLA evaluation suite."""
def __init__(self, model, config):
self.model = model
self.model.eval()
self.config = config
self.results = defaultdict(list)
@torch.no_grad()
def evaluate_action_quality(self, test_data, n_samples=500):
"""Offline action prediction quality."""
errors = []
for i, batch in enumerate(test_data):
if i >= n_samples:
break
_, pred = self.model(
batch["image"], batch["language"],
batch["proprio"], batch["actions"]
)
error = (pred - batch["actions"]).abs()
errors.append(error.mean().item())
return {
"mean_error": np.mean(errors),
"std_error": np.std(errors),
"p90_error": np.percentile(errors, 90),
"within_5mm": np.mean([e < 0.005 for e in errors]),
}
@torch.no_grad()
def evaluate_language_grounding(self, grounding_data):
"""Test language understanding."""
correct = 0
total = 0
for item in grounding_data:
# Test: given instruction A, does the model produce
# different actions than instruction B?
action_a = self.model(
item["image"], item["instruction_a"], item["proprio"])
action_b = self.model(
item["image"], item["instruction_b"], item["proprio"])
# Different instructions should produce different actions
diff = (action_a - action_b).abs().mean().item()
if diff > 0.1:
correct += 1
total += 1
return {"grounding_accuracy": correct / max(total, 1)}
@torch.no_grad()
def evaluate_robustness(self, test_image, language, proprio):
"""Test robustness to input perturbations."""
base_action = self.model(test_image, language, proprio)
perturbations = {
"brightness_+20%": test_image * 1.2,
"brightness_-20%": test_image * 0.8,
"noise_σ=0.05": test_image + torch.randn_like(test_image) * 0.05,
"crop_10%": F.pad(test_image[:, :, 3:-3, 3:-3],
(3, 3, 3, 3)),
}
results = {}
for name, perturbed in perturbations.items():
perturbed = torch.clamp(perturbed, 0, 1)
action = self.model(perturbed, language, proprio)
diff = (action - base_action).abs().mean().item()
results[name] = {
"action_deviation": diff,
"stable": diff < 0.2,
}
return results
@torch.no_grad()
def evaluate_latency(self, n_runs=100):
"""Measure inference latency."""
import time
image = torch.randn(1, 3, 64, 64)
language = torch.randn(1, 128)
proprio = torch.randn(1, 7)
# Warmup
for _ in range(10):
self.model(image, language, proprio)
latencies = []
for _ in range(n_runs):
start = time.perf_counter()
self.model(image, language, proprio)
end = time.perf_counter()
latencies.append((end - start) * 1000)
return {
"mean_ms": np.mean(latencies),
"p50_ms": np.percentile(latencies, 50),
"p95_ms": np.percentile(latencies, 95),
"max_hz": 1000 / np.percentile(latencies, 95),
}
def full_report(self, test_data=None):
"""Generate complete evaluation report."""
report = {}
# Latency
report["latency"] = self.evaluate_latency()
print(f"\n{'='*60}")
print(f" FINAL VLA EVALUATION REPORT")
print(f"{'='*60}")
print(f"\n LATENCY")
print(f" Mean: {report['latency']['mean_ms']:.1f} ms")
print(f" P95: {report['latency']['p95_ms']:.1f} ms")
print(f" Max Hz: {report['latency']['max_hz']:.0f}")
# Robustness
img = torch.randn(1, 3, 64, 64)
lang = torch.randn(1, 128)
prop = torch.randn(1, 7)
report["robustness"] = self.evaluate_robustness(img, lang, prop)
print(f"\n ROBUSTNESS")
for name, r in report["robustness"].items():
status = "✓" if r["stable"] else "✗"
print(f" {status} {name}: deviation={r['action_deviation']:.4f}")
# Model stats
n_params = sum(p.numel() for p in self.model.parameters())
n_trainable = sum(p.numel() for p in self.model.parameters()
if p.requires_grad)
print(f"\n MODEL")
print(f" Total params: {n_params:,}")
print(f" Trainable: {n_trainable:,}")
print(f" Size (FP32): {n_params * 4 / 1e6:.1f} MB")
print(f"\n{'='*60}")
return report
# Run evaluation
import torch.nn.functional as F
config = {
"d_model": 256, "lang_dim": 128, "proprio_dim": 7,
"action_dim": 7, "chunk_size": 8, "action_type": "flow",
}
# Import from Day 111 (or recreate)
# model = KitchenVLA(config)
# evaluator = VLAEvaluator(model, config)
# report = evaluator.full_report()
Compare your design with the VLAs from Weeks 14-15:
| Dimension | Your VLA | RT-2 | Octo | OpenVLA | π₀ |
|---|---|---|---|---|---|
| Parameters | 55B | 93M | 7B | 3B | |
| Action type | Token | Diffusion | Token | Flow | |
| Control freq | 3 Hz | 10 Hz | 5 Hz | 50 Hz | |
| Training data | 130K+web | 800K | 970K | ~1M | |
| Novel objects | ✓✓✓ | ✓ | ✓✓ | ✓✓ | |
| Dexterity | ✓ | ✓ | ✓ | ✓✓✓ | |
| Deploy cost | $$$ | $ | $$ | $$ |
Based on your evaluation results, document:
IF I REDESIGNED THIS VLA:
1. Architecture change:
Current: _______________
Would change to: ________
Because: ________________
2. Training change:
Current: _______________
Would change to: ________
Because: ________________
3. Deployment change:
Current: _______________
Would change to: ________
Because: ________________
Key lesson learned from implementation:
_________________________________
Answer these without referring to notes. They span the entire curriculum.
Q1 (LLM Foundations): Explain the attention mechanism in one equation and state its computational complexity with respect to sequence length $n$.
Q2 (Training): What is the difference between pre-training, fine-tuning, and RLHF? Give the objective function for each.
Q3 (Vision): How does a Vision Transformer (ViT) convert a 224×224 image into a sequence of tokens? How many tokens result with patch size 16?
Q4 (VLMs): Explain how LLaVA connects a vision encoder to an LLM. What is the projection layer's role?
Q5 (Diffusion): Write the forward process equation for DDPM. Why is the reverse process tractable when conditioned on $x_0$?
Q6 (Flow Matching): State the flow matching objective. Why is it preferred over diffusion for action generation?
Q7 (Action Chunking): Why do VLAs predict chunks of actions instead of single actions? Name 2 benefits and 1 risk.
Q8 (VLA Design): You need a VLA for a bimanual robot (14 DOF) doing dexterous assembly. Which architecture from the survey would you choose and why?
Q9 (Sim-to-Real): Explain domain randomization. Why does it work? When does it fail?
Q10 (Deployment): A fleet of 50 robots runs your VLA. Model v2 achieves 89% success vs v1's 85% in A/B testing with 200 episodes per group. Is this statistically significant? How would you verify?
Week 1-2: LLM foundations (attention, transformers, training)
Week 3-4: Advanced LLMs (RLHF, reasoning, efficiency)
Week 5-6: Vision + multimodal (ViT, CLIP, VLMs)
Week 7-8: Diffusion + generation (DDPM, flow matching)
Week 9-10: Policy learning (BC, ACT, diffusion policy)
Week 11-12: Action representations (tokens, chunks, flow)
Week 13: Capstone — integration checkpoint
Week 14-15: VLA survey (RT-1/2, Octo, OpenVLA, π₀, GR-2)
Week 16: Deployment (compute, safety, fleet, capstone)
The thread that connects everything:
Language understanding (Weeks 1-4)
→ Visual perception (Weeks 5-6)
→ Generation & dynamics (Weeks 7-8)
→ Action & control (Weeks 9-12)
→ Integrated VLAs (Weeks 13-15)
→ Real-world deployment (Week 16)
Write 500+ words: "My understanding of Vision-Language-Action Models: what they are, how they work, and where they're going."
Structure: 1. What is a VLA? (2-3 sentences) 2. The key technical insight (why LLMs → robots works) 3. The three critical components (vision, language, action) 4. The training recipe that makes it work 5. What's still hard (honest assessment) 6. Your prediction for 2026-2028 7. What you'll build next
╔═══════════════════════════════════════════════╗
║ ║
║ LLM-to-VLA Curriculum: 112 Days Complete ║
║ ║
║ 16 weeks of systematic study ║
║ From attention mechanisms to fleet deploy ║
║ From self-attention to self-driving robots ║
║ ║
║ You now understand: ║
║ • How LLMs process language ║
║ • How VLMs understand images ║
║ • How diffusion/flow models generate ║
║ • How VLAs control robots ║
║ • How to deploy VLAs at scale ║
║ ║
║ The field is moving fast. ║
║ You have the foundation to move with it. ║
║ ║
╚═══════════════════════════════════════════════╝