← Back to Curriculum

Week 8: 3D Vision + Video

Days 50–56 · 17.5 hours

This week extends vision transformers beyond 2D images into depth, point clouds, video, and detection. You'll build toward the multi-modal perception stack that VLAs need.

Daily Lessons

Day Topic Phase Focus
50 Stop & Reflect #3 IV Universal tokenization insight
51 3D Vision & Depth IV Monocular depth, MiDaS
52 Point Clouds & 3D Scenes IV PointNet, 3D for robotics
53 Video Understanding Day 1 IV Temporal attention, TimeSformer
54 Video Understanding Day 2 IV VideoMAE, video-text pretraining
55 DETR + Florence-2 + SAM 2 IV Detection as set prediction
56 Vision-Language Bridge IV Connecting vision to LLMs

Key Concepts

  • Universal tokenization: images, depth maps, point clouds, and video frames all become token sequences
  • 3D perception for robotics: depth estimation and point cloud understanding for manipulation and navigation
  • Video as temporal token sequences: extending spatial attention to temporal coherence
  • Vision-language bridge: projection layers, cross-attention, Q-Former — the gateway to VLMs

Study Notes References