← Back to Curriculum

Week 10: VLM Practice

Days 64–70 · 17.5 hours

This week surveys the open VLM landscape, tackles spatial reasoning, reflects on the gap between VLMs and VLAs, then closes Phase V with capstone projects and hands-on fine-tuning.

Daily Lessons

Day	Topic	Phase	Focus
64	Open VLM Landscape	V	InternVL, Qwen-VL, Phi-3-Vision
65	Spatial Reasoning & Grounding	V	Visual grounding, referring expressions
66	Stop & Reflect #4	V	From seeing to acting
67	Phase V Capstone Day 1	V	VLM inference pipeline
68	Phase V Capstone Day 2	V	Evaluation + checkpoint
69	VLM Fine-Tuning Day 1	V	LoRA fine-tuning on custom data
70	VLM Fine-Tuning Day 2	V	Evaluation vs base model

Key Concepts

Open VLM ecosystem: InternVL, Qwen-VL, Phi-3-Vision, Idefics2 — strengths, trade-offs, and benchmarks
Spatial reasoning: grounding language to image regions — the capability VLAs need most
VLM fine-tuning: LoRA on visual instruction data, custom domain adaptation
The VLM→VLA gap: VLMs describe actions in words; VLAs output executable motor commands

Study Notes References

10 — Vision-Language Models