← Week 14: VLA Architectures

Day 97: π₀.5 — Hybrid VLA with Language Planning

Phase VII — VLAs: Architecture to Deployment | Week 14 | 2.5 hours "Think in language, act in flow. π₀.5 generates a natural language plan, then executes each step with flow matching actions." — Physical Intelligence, 2025


Theory (45 min)

97.1 The Two-System Architecture

π₀.5 separates thinking from acting:

System 1 (slow, deliberate):
  VLM generates a language plan
  "1. Open the drawer  2. Pick up the sponge  3. Place it in the sink"

System 2 (fast, reactive):
  Flow matching generates motor actions for each step
  ΔEE = [0.01, -0.02, 0.03, ...] at 50 Hz

97.2 Architecture

┌─────────────────────────────────────────────────────────┐
│                        π₀.5                              │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  Image + Language instruction                            │
│       │                                                  │
│       ▼                                                  │
│  ┌────────────────────────────┐                          │
│  │  VLM Backbone (PaliGemma)  │                          │
│  │                            │                          │
│  │  Mode 1: PLAN              │                          │
│  │  Output: text plan tokens  │ "open drawer, pick up.." │
│  │                            │                          │
│  │  Mode 2: ACT               │                          │
│  │  Output: flow features     │ → action expert          │
│  └────────────┬───────────────┘                          │
│               │                                          │
│       ┌───────┴──────────┐                               │
│       │                  │                               │
│   PLAN mode          ACT mode                            │
│       │                  │                               │
│       ▼                  ▼                               │
│  Language tokens    Flow Matching Expert                  │
│  (next sub-task)    (motor commands)                     │
│                                                          │
└─────────────────────────────────────────────────────────┘

97.3 Plan → Act Loop

# Pseudo-code for π₀.5 inference
plan = vlm.generate_plan(image, instruction)
# plan = ["open drawer", "pick up sponge", "place in sink"]

for sub_task in plan:
    while not sub_task_complete(sub_task):
        image = camera.capture()
        action_chunk = flow_expert.sample(
            context=vlm.encode(image, sub_task),
            n_steps=10,
        )
        robot.execute(action_chunk)

97.4 Training Strategy

π₀.5 trains on three types of data:

Data Type What the model learns Format
VLM data (web) Visual reasoning, language (image, text) pairs
Robot + language plans Task decomposition (image, instruction, plan, actions)
Robot actions only Motor control (image, sub-task, actions)

The key innovation: language plans are generated by the model during training, not human-annotated. A teacher VLM labels robot trajectories with sub-task descriptions.

97.5 Benefits of Language Planning

  1. Interpretability: humans can read and verify the plan
  2. Compositionality: novel task combinations without retraining
  3. Error recovery: if a sub-task fails, re-plan from current state
  4. Long-horizon: chains of sub-tasks handle multi-minute tasks

97.6 Results

Task π₀ π₀.5 Improvement
Clean table (5 items) 70% 88% +18%
Prepare bento box 45% 72% +27%
Laundry fold + sort 60% 82% +22%
Novel instruction combo 20% 65% +45%

Biggest gains on long-horizon and compositional tasks.


Implementation (60 min)

Plan-Act Architecture

import torch
import torch.nn as nn

class PlanActVLA(nn.Module):
    """Hybrid VLA with language planning + flow matching actions."""

    def __init__(self, d_model=512, vocab_size=10000, action_dim=7,
                 chunk_size=16, n_plan_tokens=50):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.n_plan_tokens = n_plan_tokens

        # Shared vision + language encoder
        self.vision_enc = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=4, padding=3), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
            nn.Flatten(), nn.Linear(64*16, d_model),
        )
        self.text_embed = nn.Embedding(vocab_size, d_model)

        # Mode embeddings
        self.plan_mode_token = nn.Parameter(torch.randn(d_model))
        self.act_mode_token = nn.Parameter(torch.randn(d_model))

        # Shared transformer backbone
        enc_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=8, dim_feedforward=d_model*4,
            batch_first=True,
        )
        self.backbone = nn.TransformerEncoder(enc_layer, num_layers=6)

        # Plan head (language generation)
        self.plan_head = nn.Linear(d_model, vocab_size)

        # Action head (flow matching)
        self.flow_head = FlowMatchingActionExpert(
            context_dim=d_model, action_dim=action_dim,
            chunk_size=chunk_size,
        )

    def plan(self, image, instruction_tokens):
        """Generate a language plan (autoregressive)."""
        B = image.shape[0]
        vis = self.vision_enc(image).unsqueeze(1)
        text = self.text_embed(instruction_tokens)
        mode = self.plan_mode_token.unsqueeze(0).unsqueeze(0).expand(B, 1, -1)

        generated = []
        context = torch.cat([vis, text, mode], dim=1)

        for _ in range(self.n_plan_tokens):
            T = context.shape[1]
            mask = torch.triu(torch.ones(T, T, device=image.device), diagonal=1).bool()
            out = self.backbone(context, mask=mask)
            logits = self.plan_head(out[:, -1])
            next_token = logits.argmax(dim=-1)
            generated.append(next_token)
            next_emb = self.text_embed(next_token).unsqueeze(1)
            context = torch.cat([context, next_emb], dim=1)

        return torch.stack(generated, dim=1)  # (B, n_plan_tokens)

    def act(self, image, subtask_tokens, actions=None):
        """Generate motor actions for a sub-task."""
        B = image.shape[0]
        vis = self.vision_enc(image).unsqueeze(1)
        text = self.text_embed(subtask_tokens)
        mode = self.act_mode_token.unsqueeze(0).unsqueeze(0).expand(B, 1, -1)

        context = torch.cat([vis, text, mode], dim=1)
        out = self.backbone(context)
        pooled = out.mean(dim=1)

        if actions is not None:
            return self.flow_head.training_loss(pooled, actions)
        else:
            return self.flow_head.sample(pooled)

    def full_pipeline(self, image, instruction_tokens):
        """Plan then act."""
        # Step 1: Generate plan
        plan_tokens = self.plan(image, instruction_tokens)

        # Step 2: Execute first sub-task
        # In practice, you'd parse plan_tokens into sub-tasks
        # Here we use the first chunk of plan tokens as the sub-task
        actions = self.act(image, plan_tokens[:, :10])

        return plan_tokens, actions

# Demo
# Reuse FlowMatchingActionExpert from Day 96
model = PlanActVLA()
img = torch.randn(2, 3, 224, 224)
instr = torch.randint(0, 1000, (2, 15))
actions_gt = torch.randn(2, 16, 7)

# Plan
plan = model.plan(img, instr)
print(f"Plan tokens: {plan.shape}")  # (2, 50)

# Act (training)
loss = model.act(img, instr, actions_gt)
print(f"Action loss: {loss.item():.4f}")

# Act (inference)
pred_actions = model.act(img, instr)
print(f"Predicted actions: {pred_actions.shape}")  # (2, 16, 7)

Exercise (45 min)

  1. Plan quality analysis: Generate plans for 10 different instructions. Rate plan quality on a 1-5 scale. What types of instructions produce the best/worst plans?

  2. Plan vs no-plan ablation: Compare π₀ (direct action) with π₀.5 (plan then act) on a 3-step manipulation task. Does planning help? When does it hurt?

  3. Re-planning frequency: Experiment with re-planning after every sub-task vs every N steps vs only on failure. What's the optimal re-planning strategy?

  4. Plan annotation: Create a dataset of 20 trajectories. Manually annotate sub-task boundaries. Then train a model to auto-annotate. Compare annotation quality.


Key Takeaways

  1. π₀.5 = VLM planner + flow matching actor — think then do
  2. Language plans provide interpretability, compositionality, and error recovery
  3. Mode switching (plan vs act) uses the same backbone with different heads
  4. Auto-labeled plans remove the need for human annotation of sub-tasks
  5. Biggest gains on long-horizon tasks where decomposition matters most

Connection to the Thread

We've now seen the full spectrum of VLA designs: tokenized actions (RT-2, OpenVLA), diffusion heads (Octo), flow matching (π₀), and hybrid plan+act (π₀.5). Tomorrow we look at GR-2 and GROOT N1 — video generation models repurposed for robot control. The question: can predicting future video frames teach a robot to act?


Further Reading

  • Physical Intelligence (2025), "π₀.5: a Vision-Language-Action Model with Open-World Generalization"
  • Ahn et al. (2022), "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances" (SayCan — precursor to plan+act)
← Day 96: π₀ Day 98: GR-2 & GROOT N1 →