← Back to Curriculum

Week 7: Vision Transformers

Days 45–49 · 12.5 hours

This week opens Phase IV by applying the transformer — the same architecture you mastered in Phase II — to images. You'll see that an image is just a sequence of patch tokens.

Daily Lessons

Day	Topic	Phase	Focus
45	ViT — An Image Is Worth 16×16 Words	IV	Patch embedding, [CLS] token
46	Training ViT + DeiT	IV	Data augmentation, distillation
47	Swin Transformer	IV	Shifted windows, hierarchical features
48	DINO & Self-Supervised Vision	IV	Self-distillation, attention maps
49	MAE — Masked Autoencoders	IV	75% masking, reconstruction

Key Concepts

Images as sequences: split into 16×16 patches, linearly project, add position embeddings → transformer encoder
Efficient attention: Swin's shifted-window approach reduces $O(n^2)$ to $O(n)$ per window
Self-supervised vision: DINO learns segmentation from self-distillation; MAE learns by reconstructing masked patches
Connection to BERT: MAE is the visual analog of masked language modeling

Study Notes References