Phase VI — Robot Learning: RL, Diffusion & Data | Week 13 | 3 hours "The tools are built. The pipeline works. Now consolidate before we apply everything to VLAs."
Consolidate your capstone work into a single clean document:
YOUR ROBOT LEARNING PIPELINE
=============================
Task: _______________________
Environment: ________________
Data:
- Episodes collected: ___
- Success rate in demos: ___%
- Total transitions: ___
- Augmentations used: ___
Models trained:
1. Baseline BC: SR = ___% [CI: ___, ___]
2. Advanced (___): SR = ___% [CI: ___, ___]
3. Expert upper bound: SR = ___%
Key ablation findings:
- Data quantity: ___
- Chunk size: ___
- Augmentation impact: ___
Failure analysis:
- Dominant failure mode: ___
- Fix applied: ___
- Improvement: ___ → ___%
Draw the complete architecture of your best policy:
Input → [Obs Encoder] → [Policy Network] → [Action Decoder] → Output
│ │ │ │ │
│ What type? What type? What type? How executed?
│ (MLP/CNN/ViT) (BC/GMM/Diff) (chunk/token) (direct/IK)
You've now built three stacks of knowledge:
Stack 1: Generative Models (Weeks 11-12)
├── RL foundations (MDP, policy gradient, PPO)
├── DDPM, DDIM, classifier-free guidance
├── Latent diffusion, flow matching
└── Connection: same math generates images AND robot actions
Stack 2: Imitation Learning (Week 12)
├── BC → DAgger → ACT → Decision Transformer → Diffusion Policy
├── Action representations (joint/EE, absolute/delta, rotation)
├── Action tokenization (uniform bins, VQ-VAE)
└── Connection: transformers can predict actions like tokens
Stack 3: Data & Evaluation (Week 13)
├── Data collection, quality, mixing
├── Policy evaluation with statistical rigor
├── Systematic debugging
└── Connection: data quality > model architecture
Write 500+ words: "What is the single most important lesson from Phase VI, and how does it change your understanding of what VLAs will need to succeed?"
Consider: - Why data matters more than architecture - Why multimodality is the core challenge - How diffusion/flow models solve the right problem - What evaluation rigor means for VLA deployment
Answer each question in 3-5 sentences with mathematical detail:
Q1. DDPM Training Objective Write the DDPM loss function. Explain what $\epsilon_\theta$, $\alpha_t$, and $\bar{\alpha}_t$ represent. Why does predicting noise work?
Q2. Diffusion Policy vs BC You have a task where the robot can grasp a mug from the left or right. Explain with a diagram why BC fails and Diffusion Policy succeeds.
Q3. Behavioral Cloning vs DAgger BC suffers from compounding errors. Explain the mechanism (with the $T^2$ error bound) and how DAgger fixes it.
Q4. Action Tokenization RT-2 uses 256 bins for action discretization. For a robot arm with $\Delta x \in [-5\text{cm}, 5\text{cm}]$, what's the resolution per bin? Is this sufficient for manipulation?
Q5. PPO and RLHF Write the PPO clipped objective. Explain why clipping prevents catastrophic policy updates. How does RLHF use PPO?
Q6. Flow Matching vs Diffusion Name three advantages of flow matching over DDPM. What does "straight paths in probability space" mean geometrically?
Q7. Data Quality You have 1000 demonstrations but your policy only achieves 40% success rate. List 5 potential data quality issues and how to diagnose each.
Q8. Policy Debugging Your policy reaches for the correct object but overshoots by 2cm every time. Categorize this failure, hypothesize the root cause, and propose a fix.
| Score | Meaning |
|---|---|
| 8/8 | Ready for Phase VII |
| 6-7/8 | Review weak areas, then proceed |
| 4-5/8 | Re-study relevant days before continuing |
| <4/8 | Repeat Week 12 exercises |
| Aspect | Phase VI | Phase VII |
|---|---|---|
| Focus | Building blocks | Complete systems |
| Scale | Single-task | Multi-task, multi-embodiment |
| Architecture | Policy networks | VLMs + action heads |
| Data | Hundreds of demos | Millions of episodes |
| Evaluation | Simulation | Real-world deployment |
Everything. Phase VII VLAs are built from Phase VI components: - RT-1 = ViT encoder + action tokenization (Days 22, 83) - RT-2 = VLM + action tokens (Days 36-40, 83) - Diffusion Policy = denoising action head (Day 81) - π₀ = VLM + flow matching (Days 77, 81) - OpenVLA = open-source VLM + action tokens (Days 36-40, 83)
Phase VII begins tomorrow with RT-1 — the first Robotics Transformer. It's a 35M parameter model that takes images + language instructions and outputs tokenized actions. Simple, effective, and the foundation everything else builds on. The transition from "robot learning components" to "complete VLA systems" starts now.