← Back to Curriculum

Week 8: 3D Vision + Video

Days 50–56 · 17.5 hours

This week extends vision transformers beyond 2D images into depth, point clouds, video, and detection. You'll build toward the multi-modal perception stack that VLAs need.

Daily Lessons

Day	Topic	Phase	Focus
50	Stop & Reflect #3	IV	Universal tokenization insight
51	3D Vision & Depth	IV	Monocular depth, MiDaS
52	Point Clouds & 3D Scenes	IV	PointNet, 3D for robotics
53	Video Understanding Day 1	IV	Temporal attention, TimeSformer
54	Video Understanding Day 2	IV	VideoMAE, video-text pretraining
55	DETR + Florence-2 + SAM 2	IV	Detection as set prediction
56	Vision-Language Bridge	IV	Connecting vision to LLMs

Key Concepts

Universal tokenization: images, depth maps, point clouds, and video frames all become token sequences
3D perception for robotics: depth estimation and point cloud understanding for manipulation and navigation
Video as temporal token sequences: extending spatial attention to temporal coherence
Vision-language bridge: projection layers, cross-attention, Q-Former — the gateway to VLMs

Study Notes References