← Week 12: Imitation Learning

Day 84: 🛑 Stop & Reflect #5

Phase VI — Robot Learning: RL, Diffusion & Data | Week 12 | 2.5 hours "From diffusing pixels to diffusing robot actions. Same math, different space. The abstraction layer is what makes intelligence general."


Purpose

This is a reflection day. No new implementation. Consolidate the conceptual leap from generative models for images to generative models for robot actions. Write, think, connect.


Reflection 1: The Generative Action Paradigm (60 min)

The Progression

You've now seen four approaches to generating robot actions:

Approach Key Idea Strengths Weaknesses
BC (Day 78) Supervised regression Simple Mode averaging, compounding error
Decision Transformer (Day 80) Sequence prediction Return conditioning No stitching, limited
ACT (Day 79) CVAE + chunking Multimodal, temporal Training instability (KL)
Diffusion Policy (Day 81) Denoise actions Full distribution, expressive Slow inference

Writing Prompt

Write 500+ words answering: "Why is generating robot actions harder than generating images, and what properties of diffusion models make them well-suited for both?"

Consider:

Dimension Images Robot Actions
Dimensionality High (512×512×3) Low (7-20 DOF)
Temporal structure Single frame Sequential, causal
Physical constraints None (any pixel valid) Joint limits, collisions
Evaluation Visual quality (FID) Task success (binary)
Multimodality "Draw a cat" → many valid cats "Pick up mug" → multiple grasps
Safety Bad image = harmless Bad action = crash

Reflection 2: The Tokenization Decision (45 min)

The Fork in the Road

The field has split into two camps:

Camp 1: Tokenize actions → language model generates them - RT-2, OpenVLA: discretize actions, predict with cross-entropy - Advantage: leverage massive LLM pre-training - Risk: discretization loses precision

Camp 2: Keep actions continuous → diffusion/flow head generates them - Diffusion Policy, π₀: separate action generation module - Advantage: full continuous distribution - Risk: more complex architecture

Reflection Questions

  1. If 256 bins gives 0.4mm resolution and robot repeatability is ~1mm, does discretization actually matter?
  2. Could you use both? (VLM for high-level intent → diffusion head for low-level actions)
  3. What would Andrej Karpathy say? (Hint: Day 22 on tokenization — "the tokenizer is showing")
  4. What would Chelsea Finn say? (Hint: multimodality matters more for manipulation than navigation)

Reflection 3: Connecting the Full Thread (45 min)

The Compression = Prediction = Intelligence Thread

Trace this from Day 5 through today:

Day Concept Connection
5 Cross-entropy = compression
10-14 Attention = selective compression
22 Tokenization = lossless encoding
25 Scaling laws = compression efficiency
74 DDPM = learning to reverse entropy
77 Flow matching = optimal transport
81 Diffusion Policy = compressing action distributions
83 Action tokenization = discretizing the action signal

Write: "How does the compression/prediction thread explain why diffusion models work for robot actions? Is a diffusion policy 'compressing' the space of valid actions?"


Checkpoint Questions

Before proceeding to Week 13, verify you can answer:

  1. What is the DDPM training objective? Write the loss function and explain each term.
  2. How does DDIM speed up sampling? What's the key mathematical change?
  3. What is classifier-free guidance? Write the guided noise prediction formula.
  4. How does flow matching differ from diffusion? Name three advantages.
  5. Why does BC fail on multimodal tasks? Give a concrete example.
  6. How does ACT handle multimodality? What role does the CVAE play?
  7. What's the Decision Transformer's "prompt"? How does return conditioning work?
  8. How does RT-2 tokenize actions? What's the resolution with 256 bins?

Key Takeaways

  1. Same generative math, different space — diffusion/flow matching transfer from images to actions
  2. The tokenization vs continuous debate will shape VLA architectures for years
  3. Multimodality is the key challenge — averaging modes kills manipulation policies
  4. Compression runs through everything — from cross-entropy loss to diffusion to action tokenization

Connection to the Thread

Phase VI gave you the tools: RL foundations, diffusion models, flow matching, imitation learning, action representations, and tokenization. Next week: the practical realities of data collection, policy evaluation, and debugging — then the Phase VI capstone. After that, Phase VII applies everything to build actual VLAs.


Further Reading

  • Re-read: Ho et al. (2020), DDPM — now with deeper understanding
  • Re-read: Chi et al. (2023), Diffusion Policy — focus on action space analysis
  • Preview: Brohan et al. (2023), RT-2 — see how tokenization enables VLAs
← Day 83: Action Tokenization Day 85: Data Collection Part 1 →