Days 45–49 · 12.5 hours
This week opens Phase IV by applying the transformer — the same architecture you mastered in Phase II — to images. You'll see that an image is just a sequence of patch tokens.
| Day | Topic | Phase | Focus |
|---|---|---|---|
| 45 | ViT — An Image Is Worth 16×16 Words | IV | Patch embedding, [CLS] token |
| 46 | Training ViT + DeiT | IV | Data augmentation, distillation |
| 47 | Swin Transformer | IV | Shifted windows, hierarchical features |
| 48 | DINO & Self-Supervised Vision | IV | Self-distillation, attention maps |
| 49 | MAE — Masked Autoencoders | IV | 75% masking, reconstruction |