Days 57–63 · 17.5 hours
This week closes Phase IV with a capstone project, then launches Phase V — Vision-Language Models. CLIP, Flamingo, BLIP-2, and LLaVA bridge the gap between seeing and understanding.
| Day | Topic | Phase | Focus |
|---|---|---|---|
| 57 | Phase IV Capstone Day 1 | IV | Multi-modal feature extractor |
| 58 | Phase IV Capstone Day 2 | IV | Evaluation + checkpoint |
| 59 | CLIP — Contrastive VL Learning | V | Dual encoder, zero-shot transfer |
| 60 | CLIP Internals + SigLIP | V | Temperature, SigLIP loss |
| 61 | Flamingo & BLIP-2 | V | Perceiver resampler, Q-Former |
| 62 | LLaVA — Visual Instruction Tuning | V | MLP projection, visual chat |
| 63 | PaLI & CoCa | V | Contrastive + captioning |