← Back to Curriculum

Week 9: Phase IV Capstone + VLMs

Days 57–63 · 17.5 hours

This week closes Phase IV with a capstone project, then launches Phase V — Vision-Language Models. CLIP, Flamingo, BLIP-2, and LLaVA bridge the gap between seeing and understanding.

Daily Lessons

Day	Topic	Phase	Focus
57	Phase IV Capstone Day 1	IV	Multi-modal feature extractor
58	Phase IV Capstone Day 2	IV	Evaluation + checkpoint
59	CLIP — Contrastive VL Learning	V	Dual encoder, zero-shot transfer
60	CLIP Internals + SigLIP	V	Temperature, SigLIP loss
61	Flamingo & BLIP-2	V	Perceiver resampler, Q-Former
62	LLaVA — Visual Instruction Tuning	V	MLP projection, visual chat
63	PaLI & CoCa	V	Contrastive + captioning

Key Concepts

Phase IV capstone: combine ViT + depth + detection into a unified perception pipeline
CLIP: contrastive learning aligns vision and language in a shared embedding space
Frozen LLMs + visual adapters: Flamingo, BLIP-2, and LLaVA each solve the vision→language bridge differently
From captioning to conversation: visual instruction tuning enables multi-turn visual dialogue

Study Notes References