Phase V — Vision-Language Models | Week 10 | 1.5 hours "VLMs can see and describe. VLAs will see and ACT. You're one phase away." — Phase V reflection
Phase I (Days 1-14): Neural Network Foundations
Phase II (Days 15-30): Attention + Transformers
Phase III (Days 31-44): LLM Training + Alignment
Phase IV (Days 45-58): Vision Transformers + 3D + Video
Phase V (Days 59-66): Vision-Language Models ← YOU ARE HERE
Phase VI (Days 71+): Vision-Language-Action Models (VLAs)
Phase V is the bridge phase — it connected everything you built before:
Phase II: Transformer Phase IV: ViT
(text attention) (image attention)
│ │
└──────────────────────────────┘
│
Phase V: VLMs
(align vision + language)
│
┌────────┴────────┐
│ │
Contrastive Generative
(CLIP, SigLIP) (LLaVA, BLIP-2)
│ │
└────────┬─────────┘
│
Grounded VLMs
(Florence-2, Ferret)
│
Phase VI: VLAs
(add action to grounding)
Write 3-5 sentences for each. No notes allowed.
Why does contrastive learning on image-text pairs create a useful shared embedding space? Why is zero-shot classification an emergent property, not an explicit training objective?
You've seen three bridges: MLP (LLaVA), Q-Former (BLIP-2), and Perceiver (Flamingo). When would you choose each? What determines the right bridge?
Higher image resolution improves visual understanding but costs more tokens. How do modern VLMs (Qwen2-VL, LLaVA-NeXT) handle this? What's the right resolution for robotics?
Explain the pipeline: natural language query → VLM grounding → pixel coordinates → depth → 3D position → robot action. Where can each step fail?
VLMs can see, describe, and point. What can't they do? What additional capability does a VLA need that a VLM lacks?
Answer: Action prediction. A VLM can say "the cup is at (0.3, 0.5)" but can't predict the motor commands (joint angles, gripper width, trajectory) to actually pick it up. That requires action tokenization and policy learning — Phase VI.
Draw or sketch this on paper:
CLIP
/ \
SigLIP OpenCLIP
| |
Idefics2 LLaVA-NeXT
| |
Perceiver MLP bridge
| |
Flamingo Q-Former ──── BLIP-2
|
InstructBLIP
CoCa ──── dual objective ──── PaLI
|
PaLI-3 (SigLIP)
Grounding: Florence-2, Ferret, Kosmos-2
|
Robot perception pipeline
|
→ Phase VI: VLAs
| Front | Back |
|---|---|
| CLIP loss formula | Symmetric cross-entropy on similarity matrix with learned temperature |
| SigLIP improvement over CLIP | Pairwise sigmoid loss — no softmax, scales better across GPUs |
| LLaVA bridge | 2-layer MLP from CLIP ViT to LLM embedding space |
| BLIP-2 Q-Former | 32 learnable queries cross-attend to ViT features, 188M params |
| Flamingo gating | x + tanh(α) · CrossAttn(x, v) — gate α initialized to 0 |
| CoCa design | Unimodal text layers (contrastive) + multimodal layers (captioning) |
| Coordinate tokens | Discretize [0,1] → <loc_XXX> tokens, model generates as text |
After Phase V, you understand: - How to encode images (ViT, DINO, Depth Anything) - How to align vision and language (CLIP, SigLIP) - How to bridge vision encoders to LLMs (MLP, Q-Former, Perceiver) - How to ground language to spatial locations (coordinate tokens)
Phase VI adds the final piece: action tokens. Instead of generating text like "the cup is at (0.3, 0.5)", a VLA generates actions like "move_to(0.2, -0.1, 0.3), grasp(0.04)".
VLMs see and describe. VLAs see and ACT. You're ready for the final phase.
Phase V is complete in spirit. The capstone (Days 67-68) will verify your understanding. Then Days 69-70 add practical VLM fine-tuning skills. After that: Phase VI — where models learn to act.