← Back to Curriculum

Week 10: VLM Practice

Days 64–70 · 17.5 hours

This week surveys the open VLM landscape, tackles spatial reasoning, reflects on the gap between VLMs and VLAs, then closes Phase V with capstone projects and hands-on fine-tuning.

Daily Lessons

Day Topic Phase Focus
64 Open VLM Landscape V InternVL, Qwen-VL, Phi-3-Vision
65 Spatial Reasoning & Grounding V Visual grounding, referring expressions
66 Stop & Reflect #4 V From seeing to acting
67 Phase V Capstone Day 1 V VLM inference pipeline
68 Phase V Capstone Day 2 V Evaluation + checkpoint
69 VLM Fine-Tuning Day 1 V LoRA fine-tuning on custom data
70 VLM Fine-Tuning Day 2 V Evaluation vs base model

Key Concepts

  • Open VLM ecosystem: InternVL, Qwen-VL, Phi-3-Vision, Idefics2 — strengths, trade-offs, and benchmarks
  • Spatial reasoning: grounding language to image regions — the capability VLAs need most
  • VLM fine-tuning: LoRA on visual instruction data, custom domain adaptation
  • The VLM→VLA gap: VLMs describe actions in words; VLAs output executable motor commands

Study Notes References