← Back to Curriculum

Week 7: Vision Transformers

Days 45–49 · 12.5 hours

This week opens Phase IV by applying the transformer — the same architecture you mastered in Phase II — to images. You'll see that an image is just a sequence of patch tokens.

Daily Lessons

Day Topic Phase Focus
45 ViT — An Image Is Worth 16×16 Words IV Patch embedding, [CLS] token
46 Training ViT + DeiT IV Data augmentation, distillation
47 Swin Transformer IV Shifted windows, hierarchical features
48 DINO & Self-Supervised Vision IV Self-distillation, attention maps
49 MAE — Masked Autoencoders IV 75% masking, reconstruction

Key Concepts

  • Images as sequences: split into 16×16 patches, linearly project, add position embeddings → transformer encoder
  • Efficient attention: Swin's shifted-window approach reduces $O(n^2)$ to $O(n)$ per window
  • Self-supervised vision: DINO learns segmentation from self-distillation; MAE learns by reconstructing masked patches
  • Connection to BERT: MAE is the visual analog of masked language modeling

Study Notes References