← Back to Curriculum

Week 9: Phase IV Capstone + VLMs

Days 57–63 · 17.5 hours

This week closes Phase IV with a capstone project, then launches Phase V — Vision-Language Models. CLIP, Flamingo, BLIP-2, and LLaVA bridge the gap between seeing and understanding.

Daily Lessons

Day Topic Phase Focus
57 Phase IV Capstone Day 1 IV Multi-modal feature extractor
58 Phase IV Capstone Day 2 IV Evaluation + checkpoint
59 CLIP — Contrastive VL Learning V Dual encoder, zero-shot transfer
60 CLIP Internals + SigLIP V Temperature, SigLIP loss
61 Flamingo & BLIP-2 V Perceiver resampler, Q-Former
62 LLaVA — Visual Instruction Tuning V MLP projection, visual chat
63 PaLI & CoCa V Contrastive + captioning

Key Concepts

  • Phase IV capstone: combine ViT + depth + detection into a unified perception pipeline
  • CLIP: contrastive learning aligns vision and language in a shared embedding space
  • Frozen LLMs + visual adapters: Flamingo, BLIP-2, and LLaVA each solve the vision→language bridge differently
  • From captioning to conversation: visual instruction tuning enables multi-turn visual dialogue

Study Notes References