Phase VII — VLAs: Architecture to Deployment | Week 14 | 2.5 hours "Think in language, act in flow. π₀.5 generates a natural language plan, then executes each step with flow matching actions." — Physical Intelligence, 2025
π₀.5 separates thinking from acting:
System 1 (slow, deliberate):
VLM generates a language plan
"1. Open the drawer 2. Pick up the sponge 3. Place it in the sink"
System 2 (fast, reactive):
Flow matching generates motor actions for each step
ΔEE = [0.01, -0.02, 0.03, ...] at 50 Hz
┌─────────────────────────────────────────────────────────┐
│ π₀.5 │
├─────────────────────────────────────────────────────────┤
│ │
│ Image + Language instruction │
│ │ │
│ ▼ │
│ ┌────────────────────────────┐ │
│ │ VLM Backbone (PaliGemma) │ │
│ │ │ │
│ │ Mode 1: PLAN │ │
│ │ Output: text plan tokens │ "open drawer, pick up.." │
│ │ │ │
│ │ Mode 2: ACT │ │
│ │ Output: flow features │ → action expert │
│ └────────────┬───────────────┘ │
│ │ │
│ ┌───────┴──────────┐ │
│ │ │ │
│ PLAN mode ACT mode │
│ │ │ │
│ ▼ ▼ │
│ Language tokens Flow Matching Expert │
│ (next sub-task) (motor commands) │
│ │
└─────────────────────────────────────────────────────────┘
# Pseudo-code for π₀.5 inference
plan = vlm.generate_plan(image, instruction)
# plan = ["open drawer", "pick up sponge", "place in sink"]
for sub_task in plan:
while not sub_task_complete(sub_task):
image = camera.capture()
action_chunk = flow_expert.sample(
context=vlm.encode(image, sub_task),
n_steps=10,
)
robot.execute(action_chunk)
π₀.5 trains on three types of data:
| Data Type | What the model learns | Format |
|---|---|---|
| VLM data (web) | Visual reasoning, language | (image, text) pairs |
| Robot + language plans | Task decomposition | (image, instruction, plan, actions) |
| Robot actions only | Motor control | (image, sub-task, actions) |
The key innovation: language plans are generated by the model during training, not human-annotated. A teacher VLM labels robot trajectories with sub-task descriptions.
| Task | π₀ | π₀.5 | Improvement |
|---|---|---|---|
| Clean table (5 items) | 70% | 88% | +18% |
| Prepare bento box | 45% | 72% | +27% |
| Laundry fold + sort | 60% | 82% | +22% |
| Novel instruction combo | 20% | 65% | +45% |
Biggest gains on long-horizon and compositional tasks.
import torch
import torch.nn as nn
class PlanActVLA(nn.Module):
"""Hybrid VLA with language planning + flow matching actions."""
def __init__(self, d_model=512, vocab_size=10000, action_dim=7,
chunk_size=16, n_plan_tokens=50):
super().__init__()
self.d_model = d_model
self.vocab_size = vocab_size
self.n_plan_tokens = n_plan_tokens
# Shared vision + language encoder
self.vision_enc = nn.Sequential(
nn.Conv2d(3, 64, 7, stride=4, padding=3), nn.ReLU(),
nn.AdaptiveAvgPool2d(4),
nn.Flatten(), nn.Linear(64*16, d_model),
)
self.text_embed = nn.Embedding(vocab_size, d_model)
# Mode embeddings
self.plan_mode_token = nn.Parameter(torch.randn(d_model))
self.act_mode_token = nn.Parameter(torch.randn(d_model))
# Shared transformer backbone
enc_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=8, dim_feedforward=d_model*4,
batch_first=True,
)
self.backbone = nn.TransformerEncoder(enc_layer, num_layers=6)
# Plan head (language generation)
self.plan_head = nn.Linear(d_model, vocab_size)
# Action head (flow matching)
self.flow_head = FlowMatchingActionExpert(
context_dim=d_model, action_dim=action_dim,
chunk_size=chunk_size,
)
def plan(self, image, instruction_tokens):
"""Generate a language plan (autoregressive)."""
B = image.shape[0]
vis = self.vision_enc(image).unsqueeze(1)
text = self.text_embed(instruction_tokens)
mode = self.plan_mode_token.unsqueeze(0).unsqueeze(0).expand(B, 1, -1)
generated = []
context = torch.cat([vis, text, mode], dim=1)
for _ in range(self.n_plan_tokens):
T = context.shape[1]
mask = torch.triu(torch.ones(T, T, device=image.device), diagonal=1).bool()
out = self.backbone(context, mask=mask)
logits = self.plan_head(out[:, -1])
next_token = logits.argmax(dim=-1)
generated.append(next_token)
next_emb = self.text_embed(next_token).unsqueeze(1)
context = torch.cat([context, next_emb], dim=1)
return torch.stack(generated, dim=1) # (B, n_plan_tokens)
def act(self, image, subtask_tokens, actions=None):
"""Generate motor actions for a sub-task."""
B = image.shape[0]
vis = self.vision_enc(image).unsqueeze(1)
text = self.text_embed(subtask_tokens)
mode = self.act_mode_token.unsqueeze(0).unsqueeze(0).expand(B, 1, -1)
context = torch.cat([vis, text, mode], dim=1)
out = self.backbone(context)
pooled = out.mean(dim=1)
if actions is not None:
return self.flow_head.training_loss(pooled, actions)
else:
return self.flow_head.sample(pooled)
def full_pipeline(self, image, instruction_tokens):
"""Plan then act."""
# Step 1: Generate plan
plan_tokens = self.plan(image, instruction_tokens)
# Step 2: Execute first sub-task
# In practice, you'd parse plan_tokens into sub-tasks
# Here we use the first chunk of plan tokens as the sub-task
actions = self.act(image, plan_tokens[:, :10])
return plan_tokens, actions
# Demo
# Reuse FlowMatchingActionExpert from Day 96
model = PlanActVLA()
img = torch.randn(2, 3, 224, 224)
instr = torch.randint(0, 1000, (2, 15))
actions_gt = torch.randn(2, 16, 7)
# Plan
plan = model.plan(img, instr)
print(f"Plan tokens: {plan.shape}") # (2, 50)
# Act (training)
loss = model.act(img, instr, actions_gt)
print(f"Action loss: {loss.item():.4f}")
# Act (inference)
pred_actions = model.act(img, instr)
print(f"Predicted actions: {pred_actions.shape}") # (2, 16, 7)
Plan quality analysis: Generate plans for 10 different instructions. Rate plan quality on a 1-5 scale. What types of instructions produce the best/worst plans?
Plan vs no-plan ablation: Compare π₀ (direct action) with π₀.5 (plan then act) on a 3-step manipulation task. Does planning help? When does it hurt?
Re-planning frequency: Experiment with re-planning after every sub-task vs every N steps vs only on failure. What's the optimal re-planning strategy?
Plan annotation: Create a dataset of 20 trajectories. Manually annotate sub-task boundaries. Then train a model to auto-annotate. Compare annotation quality.
We've now seen the full spectrum of VLA designs: tokenized actions (RT-2, OpenVLA), diffusion heads (Octo), flow matching (π₀), and hybrid plan+act (π₀.5). Tomorrow we look at GR-2 and GROOT N1 — video generation models repurposed for robot control. The question: can predicting future video frames teach a robot to act?