Day 97: π₀.5 — Hybrid VLA with Language Planning

Phase VII — VLAs: Architecture to Deployment | Week 14 | 2.5 hours "Think in language, act in flow. π₀.5 generates a natural language plan, then executes each step with flow matching actions." — Physical Intelligence, 2025

Previous: Day 96: π₀
Next: Day 98: GR-2 & GROOT N1
Week: Week 14 Overview
Phase: Phase VII: VLAs
Curriculum: Full Curriculum

Theory (45 min)

97.1 The Two-System Architecture

π₀.5 separates thinking from acting:

System 1 (slow, deliberate):
  VLM generates a language plan
  "1. Open the drawer  2. Pick up the sponge  3. Place it in the sink"

System 2 (fast, reactive):
  Flow matching generates motor actions for each step
  ΔEE = [0.01, -0.02, 0.03, ...] at 50 Hz

97.2 Architecture

┌─────────────────────────────────────────────────────────┐
│                        π₀.5                              │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Image + Language instruction                            │
│       │                                                  │
│       ▼                                                  │
│  ┌────────────────────────────┐                          │
│  │  VLM Backbone (PaliGemma)  │                          │
│  │                            │                          │
│  │  Mode 1: PLAN              │                          │
│  │  Output: text plan tokens  │ "open drawer, pick up.." │
│  │                            │                          │
│  │  Mode 2: ACT               │                          │
│  │  Output: flow features     │ → action expert          │
│  └────────────┬───────────────┘                          │
│               │                                          │
│       ┌───────┴──────────┐                               │
│       │                  │                               │
│   PLAN mode          ACT mode                            │
│       │                  │                               │
│       ▼                  ▼                               │
│  Language tokens    Flow Matching Expert                  │
│  (next sub-task)    (motor commands)                     │
│                                                          │
└─────────────────────────────────────────────────────────┘

97.3 Plan → Act Loop

# Pseudo-code for π₀.5 inference
plan = vlm.generate_plan(image, instruction)
# plan = ["open drawer", "pick up sponge", "place in sink"]

for sub_task in plan:
    while not sub_task_complete(sub_task):
        image = camera.capture()
        action_chunk = flow_expert.sample(
            context=vlm.encode(image, sub_task),
            n_steps=10,
        )
        robot.execute(action_chunk)

97.4 Training Strategy

π₀.5 trains on three types of data:

Data Type	What the model learns	Format
VLM data (web)	Visual reasoning, language	(image, text) pairs
Robot + language plans	Task decomposition	(image, instruction, plan, actions)
Robot actions only	Motor control	(image, sub-task, actions)

The key innovation: language plans are generated by the model during training, not human-annotated. A teacher VLM labels robot trajectories with sub-task descriptions.

97.5 Benefits of Language Planning

Interpretability: humans can read and verify the plan
Compositionality: novel task combinations without retraining
Error recovery: if a sub-task fails, re-plan from current state
Long-horizon: chains of sub-tasks handle multi-minute tasks

97.6 Results

Task	π₀	π₀.5	Improvement
Clean table (5 items)	70%	88%	+18%
Prepare bento box	45%	72%	+27%
Laundry fold + sort	60%	82%	+22%
Novel instruction combo	20%	65%	+45%

Biggest gains on long-horizon and compositional tasks.

Implementation (60 min)

Plan-Act Architecture

import torch
import torch.nn as nn

class PlanActVLA(nn.Module):
    """Hybrid VLA with language planning + flow matching actions."""

    def __init__(self, d_model=512, vocab_size=10000, action_dim=7,
                 chunk_size=16, n_plan_tokens=50):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.n_plan_tokens = n_plan_tokens

        # Shared vision + language encoder
        self.vision_enc = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=4, padding=3), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
            nn.Flatten(), nn.Linear(64*16, d_model),
        )
        self.text_embed = nn.Embedding(vocab_size, d_model)

        # Mode embeddings
        self.plan_mode_token = nn.Parameter(torch.randn(d_model))
        self.act_mode_token = nn.Parameter(torch.randn(d_model))

        # Shared transformer backbone
        enc_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=8, dim_feedforward=d_model*4,
            batch_first=True,
        )
        self.backbone = nn.TransformerEncoder(enc_layer, num_layers=6)

        # Plan head (language generation)
        self.plan_head = nn.Linear(d_model, vocab_size)

        # Action head (flow matching)
        self.flow_head = FlowMatchingActionExpert(
            context_dim=d_model, action_dim=action_dim,
            chunk_size=chunk_size,
        )

    def plan(self, image, instruction_tokens):
        """Generate a language plan (autoregressive)."""
        B = image.shape[0]
        vis = self.vision_enc(image).unsqueeze(1)
        text = self.text_embed(instruction_tokens)
        mode = self.plan_mode_token.unsqueeze(0).unsqueeze(0).expand(B, 1, -1)

        generated = []
        context = torch.cat([vis, text, mode], dim=1)

        for _ in range(self.n_plan_tokens):
            T = context.shape[1]
            mask = torch.triu(torch.ones(T, T, device=image.device), diagonal=1).bool()
            out = self.backbone(context, mask=mask)
            logits = self.plan_head(out[:, -1])
            next_token = logits.argmax(dim=-1)
            generated.append(next_token)
            next_emb = self.text_embed(next_token).unsqueeze(1)
            context = torch.cat([context, next_emb], dim=1)

        return torch.stack(generated, dim=1)  # (B, n_plan_tokens)

    def act(self, image, subtask_tokens, actions=None):
        """Generate motor actions for a sub-task."""
        B = image.shape[0]
        vis = self.vision_enc(image).unsqueeze(1)
        text = self.text_embed(subtask_tokens)
        mode = self.act_mode_token.unsqueeze(0).unsqueeze(0).expand(B, 1, -1)

        context = torch.cat([vis, text, mode], dim=1)
        out = self.backbone(context)
        pooled = out.mean(dim=1)

        if actions is not None:
            return self.flow_head.training_loss(pooled, actions)
        else:
            return self.flow_head.sample(pooled)

    def full_pipeline(self, image, instruction_tokens):
        """Plan then act."""
        # Step 1: Generate plan
        plan_tokens = self.plan(image, instruction_tokens)

        # Step 2: Execute first sub-task
        # In practice, you'd parse plan_tokens into sub-tasks
        # Here we use the first chunk of plan tokens as the sub-task
        actions = self.act(image, plan_tokens[:, :10])

        return plan_tokens, actions

# Demo
# Reuse FlowMatchingActionExpert from Day 96
model = PlanActVLA()
img = torch.randn(2, 3, 224, 224)
instr = torch.randint(0, 1000, (2, 15))
actions_gt = torch.randn(2, 16, 7)

# Plan
plan = model.plan(img, instr)
print(f"Plan tokens: {plan.shape}")  # (2, 50)

# Act (training)
loss = model.act(img, instr, actions_gt)
print(f"Action loss: {loss.item():.4f}")

# Act (inference)
pred_actions = model.act(img, instr)
print(f"Predicted actions: {pred_actions.shape}")  # (2, 16, 7)

Exercise (45 min)

Plan quality analysis: Generate plans for 10 different instructions. Rate plan quality on a 1-5 scale. What types of instructions produce the best/worst plans?
Plan vs no-plan ablation: Compare π₀ (direct action) with π₀.5 (plan then act) on a 3-step manipulation task. Does planning help? When does it hurt?
Re-planning frequency: Experiment with re-planning after every sub-task vs every N steps vs only on failure. What's the optimal re-planning strategy?
Plan annotation: Create a dataset of 20 trajectories. Manually annotate sub-task boundaries. Then train a model to auto-annotate. Compare annotation quality.

Key Takeaways

π₀.5 = VLM planner + flow matching actor — think then do
Language plans provide interpretability, compositionality, and error recovery
Mode switching (plan vs act) uses the same backbone with different heads
Auto-labeled plans remove the need for human annotation of sub-tasks
Biggest gains on long-horizon tasks where decomposition matters most

Connection to the Thread

We've now seen the full spectrum of VLA designs: tokenized actions (RT-2, OpenVLA), diffusion heads (Octo), flow matching (π₀), and hybrid plan+act (π₀.5). Tomorrow we look at GR-2 and GROOT N1 — video generation models repurposed for robot control. The question: can predicting future video frames teach a robot to act?