Days 50–56 · 17.5 hours
This week extends vision transformers beyond 2D images into depth, point clouds, video, and detection. You'll build toward the multi-modal perception stack that VLAs need.
| Day | Topic | Phase | Focus |
|---|---|---|---|
| 50 | Stop & Reflect #3 | IV | Universal tokenization insight |
| 51 | 3D Vision & Depth | IV | Monocular depth, MiDaS |
| 52 | Point Clouds & 3D Scenes | IV | PointNet, 3D for robotics |
| 53 | Video Understanding Day 1 | IV | Temporal attention, TimeSformer |
| 54 | Video Understanding Day 2 | IV | VideoMAE, video-text pretraining |
| 55 | DETR + Florence-2 + SAM 2 | IV | Detection as set prediction |
| 56 | Vision-Language Bridge | IV | Connecting vision to LLMs |