Day 66: Stop & Reflect 4

Phase V — Vision-Language Models | Week 10 | 1.5 hours "VLMs can see and describe. VLAs will see and ACT. You're one phase away." — Phase V reflection

Previous: Day 65: Spatial Grounding
Next: Day 67: Phase V Capstone Day 1
Week: Week 10 Overview
Phase: Phase V: Vision-Language Models
Curriculum: Full Curriculum

The Journey So Far

Phase I   (Days 1-14):   Neural Network Foundations
Phase II  (Days 15-30):  Attention + Transformers
Phase III (Days 31-44):  LLM Training + Alignment
Phase IV  (Days 45-58):  Vision Transformers + 3D + Video
Phase V   (Days 59-66):  Vision-Language Models  ← YOU ARE HERE
Phase VI  (Days 71+):    Vision-Language-Action Models (VLAs)

What Phase V Connected

Phase V is the bridge phase — it connected everything you built before:

Phase II: Transformer          Phase IV: ViT
(text attention)               (image attention)
        │                              │
        └──────────────────────────────┘
                       │
                 Phase V: VLMs
          (align vision + language)
                       │
              ┌────────┴────────┐
              │                  │
        Contrastive           Generative
        (CLIP, SigLIP)        (LLaVA, BLIP-2)
              │                  │
              └────────┬─────────┘
                       │
              Grounded VLMs
         (Florence-2, Ferret)
                       │
                  Phase VI: VLAs
           (add action to grounding)

Reflection Questions (60 min)

Write 3-5 sentences for each. No notes allowed.

1. CLIP's Core Insight

Why does contrastive learning on image-text pairs create a useful shared embedding space? Why is zero-shot classification an emergent property, not an explicit training objective?

2. Bridge Architecture Tradeoffs

You've seen three bridges: MLP (LLaVA), Q-Former (BLIP-2), and Perceiver (Flamingo). When would you choose each? What determines the right bridge?

3. The Resolution-Compute Tradeoff

Higher image resolution improves visual understanding but costs more tokens. How do modern VLMs (Qwen2-VL, LLaVA-NeXT) handle this? What's the right resolution for robotics?

4. From Understanding to Grounding

Explain the pipeline: natural language query → VLM grounding → pixel coordinates → depth → 3D position → robot action. Where can each step fail?

5. What's Still Missing?

VLMs can see, describe, and point. What can't they do? What additional capability does a VLA need that a VLM lacks?

Answer: Action prediction. A VLM can say "the cup is at (0.3, 0.5)" but can't predict the motor commands (joint angles, gripper width, trajectory) to actually pick it up. That requires action tokenization and policy learning — Phase VI.

Concept Map

Draw or sketch this on paper:

                    CLIP
                   /    \
          SigLIP     OpenCLIP
             |           |
         Idefics2    LLaVA-NeXT
             |           |
         Perceiver    MLP bridge
             |           |
          Flamingo    Q-Former ──── BLIP-2
                         |
                    InstructBLIP

         CoCa ──── dual objective ──── PaLI
                                        |
                              PaLI-3 (SigLIP)

    Grounding:  Florence-2, Ferret, Kosmos-2
         |
    Robot perception pipeline
         |
    → Phase VI: VLAs

Spaced Repetition Cards

Front	Back
CLIP loss formula	Symmetric cross-entropy on similarity matrix with learned temperature
SigLIP improvement over CLIP	Pairwise sigmoid loss — no softmax, scales better across GPUs
LLaVA bridge	2-layer MLP from CLIP ViT to LLM embedding space
BLIP-2 Q-Former	32 learnable queries cross-attend to ViT features, 188M params
Flamingo gating	`x + tanh(α) · CrossAttn(x, v)` — gate α initialized to 0
CoCa design	Unimodal text layers (contrastive) + multimodal layers (captioning)
Coordinate tokens	Discretize [0,1] → `<loc_XXX>` tokens, model generates as text

The Bridge to Phase VI

After Phase V, you understand: - How to encode images (ViT, DINO, Depth Anything) - How to align vision and language (CLIP, SigLIP) - How to bridge vision encoders to LLMs (MLP, Q-Former, Perceiver) - How to ground language to spatial locations (coordinate tokens)

Phase VI adds the final piece: action tokens. Instead of generating text like "the cup is at (0.3, 0.5)", a VLA generates actions like "move_to(0.2, -0.1, 0.3), grasp(0.04)".

VLMs see and describe. VLAs see and ACT. You're ready for the final phase.

Key Takeaways

Phase V is the bridge. It connected transformers (Phase II), LLMs (Phase III), and vision (Phase IV)
Simple bridges work. MLP projection matches or beats complex Q-Former architectures
Grounding is essential. Without spatial coordinates, VLMs can't drive robot actions
The pattern repeats. Same transformer, new tokenization — from words to patches to coordinates to actions
One phase left. VLAs = VLMs + action prediction

Connection to the Thread

Phase V is complete in spirit. The capstone (Days 67-68) will verify your understanding. Then Days 69-70 add practical VLM fine-tuning skills. After that: Phase VI — where models learn to act.