Phase VII — VLAs: Architecture to Deployment | Week 14 | 2.5 hours "7B parameters. Open weights. Fine-tune on 1 GPU. The first VLA the community can actually use." — Kim et al., 2024
Closed/Large: RT-2 (55B, closed)
│
Open/Large: PaLM-E (562B, partial)
│
Open/Medium: OpenVLA (7B, fully open) ← sweet spot
│
Open/Small: Octo (93M, open)
OpenVLA = Prismatic VLM + action tokenization
Image (224×224) ──→ SigLIP + DINOv2 (fused vision)
│
visual tokens (256)
│
Language ──→ Llama 2 (7B) tokenizer + embedding
│
┌────────▼────────────────┐
│ Llama 2 7B backbone │
│ (autoregressive LM) │
└────────┬────────────────┘
│
[action tokens]
256 bins × 7 dims
| Component | OpenVLA Choice | Why |
|---|---|---|
| Vision | SigLIP + DINOv2 (fused) | Combines semantic + spatial features |
| Language | Llama 2 7B | Open-source, efficient |
| Action | 256-bin tokenization | Compatible with LM head |
| Training | Open X-Embodiment subset | 970K episodes |
| Fine-tuning | LoRA (rank 32) | Single GPU, 1-4 hours |
OpenVLA fuses two vision encoders for complementary features:
Image ─┬──→ SigLIP-SO400M (semantic features)
│ │
│ ▼
│ [384 tokens × 1152 dims]
│ │
└──→ DINOv2-L (spatial features)
│
▼
[256 tokens × 1024 dims]
│
MLP fusion → [256 tokens × 4096 dims]
│
▼
Llama 2 input
SigLIP: trained on image-text pairs → knows what objects are
DINOv2: self-supervised → knows where objects are and their geometry
Phase 1: VLM pre-training (from Prismatic)
- SigLIP + DINOv2 + Llama 2
- Standard VLM training on image-text data
- Produces a strong visual reasoner
Phase 2: Action fine-tuning
- Add action tokens to vocabulary
- Fine-tune on Open X-Embodiment (970K episodes)
- Mixed batch: 50% robot, 50% VLM data
- 210K gradient steps, 8× A100s, 14 days
# Fine-tuning recipe (conceptual)
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=32, # LoRA rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj"],
lora_dropout=0.05,
)
model = get_peft_model(openvla_model, config)
# Only ~2% of parameters are trainable
# Fine-tune on 50-1000 demos in 1-4 GPU-hours
import torch
import torch.nn as nn
class PrismaticVisionEncoder(nn.Module):
"""Fused dual-encoder vision backbone."""
def __init__(self, out_dim=4096):
super().__init__()
# Simplified: in practice these are SigLIP + DINOv2
self.semantic_encoder = nn.Sequential(
nn.Conv2d(3, 64, 7, stride=2, padding=3), nn.ReLU(),
nn.Conv2d(64, 128, 3, stride=2, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(16),
nn.Flatten(2), # (B, 128, 256)
)
self.spatial_encoder = nn.Sequential(
nn.Conv2d(3, 64, 7, stride=2, padding=3), nn.ReLU(),
nn.Conv2d(64, 128, 3, stride=2, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(16),
nn.Flatten(2),
)
self.fusion = nn.Linear(256, out_dim) # 128 + 128
def forward(self, image):
sem = self.semantic_encoder(image).permute(0, 2, 1) # (B, 256, 128)
spa = self.spatial_encoder(image).permute(0, 2, 1) # (B, 256, 128)
fused = torch.cat([sem, spa], dim=-1) # (B, 256, 256)
return self.fusion(fused) # (B, 256, 4096)
class OpenVLAStyle(nn.Module):
"""Simplified OpenVLA architecture."""
def __init__(self, d_model=512, n_heads=8, n_layers=8,
text_vocab=32000, action_bins=256, action_dims=7):
super().__init__()
self.action_bins = action_bins
self.action_dims = action_dims
self.text_vocab = text_vocab
total_vocab = text_vocab + action_bins
# Vision
self.vision = PrismaticVisionEncoder(out_dim=d_model)
# Text embedding
self.text_embed = nn.Embedding(total_vocab, d_model)
# Transformer (simplified Llama)
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=n_heads,
dim_feedforward=d_model*4, batch_first=True,
)
self.transformer = nn.TransformerEncoder(encoder_layer, n_layers)
# LM head
self.lm_head = nn.Linear(d_model, total_vocab)
def forward(self, image, text_tokens, action_tokens):
"""
image: (B, 3, 224, 224)
text_tokens: (B, T_text) — text token ids
action_tokens: (B, action_dims) — ground truth action bin ids
"""
B = image.shape[0]
# Encode vision
vis_tokens = self.vision(image) # (B, 256, D)
# Encode text
text_emb = self.text_embed(text_tokens) # (B, T, D)
# Encode action tokens (teacher forcing, shift right)
action_input = action_tokens[:, :-1] + self.text_vocab # Offset to action vocab
action_emb = self.text_embed(action_input) # (B, action_dims-1, D)
# Concatenate: [vision | text | actions]
sequence = torch.cat([vis_tokens, text_emb, action_emb], dim=1)
# Causal mask
T = sequence.shape[1]
mask = torch.triu(torch.ones(T, T, device=image.device), diagonal=1).bool()
# Transform
out = self.transformer(sequence, mask=mask)
# Get predictions at action positions
n_vis = vis_tokens.shape[1]
n_text = text_tokens.shape[1]
action_start = n_vis + n_text - 1 # One before first action
action_end = action_start + self.action_dims
action_logits = self.lm_head(out[:, action_start:action_end])
action_logits = action_logits[:, :, self.text_vocab:] # Action bins only
loss = nn.functional.cross_entropy(
action_logits.reshape(-1, self.action_bins),
action_tokens.reshape(-1),
)
return loss
@torch.no_grad()
def predict(self, image, text_tokens):
"""Autoregressive action prediction."""
vis = self.vision(image)
text_emb = self.text_embed(text_tokens)
sequence = torch.cat([vis, text_emb], dim=1)
actions = []
for _ in range(self.action_dims):
T = sequence.shape[1]
mask = torch.triu(torch.ones(T, T, device=image.device), diagonal=1).bool()
out = self.transformer(sequence, mask=mask)
logits = self.lm_head(out[:, -1:])
action_logits = logits[:, :, self.text_vocab:]
bin_idx = action_logits.argmax(dim=-1) # (B, 1)
actions.append(bin_idx.squeeze(-1))
token_emb = self.text_embed(bin_idx.squeeze(-1) + self.text_vocab)
sequence = torch.cat([sequence, token_emb.unsqueeze(1)], dim=1)
return torch.stack(actions, dim=-1)
# Demo
model = OpenVLAStyle()
img = torch.randn(2, 3, 224, 224)
txt = torch.randint(0, 1000, (2, 20))
act = torch.randint(0, 256, (2, 7))
loss = model(img, txt, act)
print(f"Loss: {loss.item():.4f}")
pred = model.predict(img, txt)
print(f"Predicted: {pred.shape}")
Dual encoder analysis: Compare SigLIP-only vs DINOv2-only vs fused. Which features help more for grasping (spatial) vs object identification (semantic)?
LoRA fine-tuning: Implement LoRA for the transformer layers. Compare trainable parameter count vs full fine-tuning. Measure performance with 100 demos.
Vocabulary efficiency: The action vocabulary (256 tokens) is <1% of the text vocabulary (32K). Analyze: does this imbalance affect training? What if you use 1024 action bins?
Scaling comparison: Plot parameter count vs success rate for RT-1 (35M), Octo (93M), OpenVLA (7B), RT-2 (55B). Is bigger always better?
RT-2 showed VLMs can be VLAs. Octo showed open-source works. OpenVLA made it practical with LoRA fine-tuning. Tomorrow: π₀ from Physical Intelligence takes a radically different approach — a VLM backbone with a flow matching action head, achieving state-of-the-art on dexterous manipulation. The question: tokenize actions or generate them continuously?