Phase IV — Vision: ViT, 3D, Video | Week 8 | 2.5 hours "The transformer doesn't care what tokens are — text, patches, audio frames, point clouds. It just processes sequences." — The universal lesson
The most important insight from Phase IV so far is that the transformer architecture is modality-agnostic. The same self-attention mechanism that processes words also processes image patches:
Text (Phase II): "The cat sat" → [tok₁, tok₂, tok₃] → Transformer → prediction
↓
Images (Phase IV): [photo] → [patch₁, ..., patch₁₉₆] → Transformer → classification
↓
Audio: [waveform] → [frame₁, ..., frame_N] → Transformer → transcription
↓
Video: [clip] → [frame₁×patch₁, ..., frame_T×patch_N] → Transformer → understanding
↓
3D: [point cloud] → [point₁, ..., point_K] → Transformer → segmentation
↓
Actions (Phase VI): [trajectory] → [action₁, ..., action_T] → Transformer → control
The transformer is a universal sequence processor. The innovation is always in how you tokenize the input, not in the architecture itself.
Spend 15 minutes writing answers (without looking at notes):
Explain ViT to a colleague in 3 sentences. Include: patch embedding, position embedding, [CLS] token.
Why does Swin Transformer exist? What problem does it solve that ViT doesn't?
DINO vs MAE: Both are self-supervised. What's the key difference in their learning signal? When would you choose one over the other?
The masking insight: Why can MAE mask 75% of patches (vs BERT's 15%)? What does this tell you about visual vs linguistic redundancy?
Position embeddings: ViT uses learned 1D positions. Swin uses relative position bias. Why does the 1D approach work at all for a 2D image?
Self-Attention (Day 9)
│
├── Encoder-only: BERT (Day 21) ──── masked prediction
│ │
│ ┌───────────────┤
│ │ │
│ MAE (Day 49) BEiT (visual tokens)
│
├── Decoder-only: GPT (Day 23) ──── autoregressive prediction
│
├── Enc-Dec: T5 (Day 28) ──── seq2seq
│
└── Vision:
├── ViT (Day 45) ──── global attention, [CLS] classification
│ │
│ ├── DeiT (Day 46) ──── training recipe, distillation token
│ │
│ └── DINO (Day 48) ──── self-distillation, emergent segmentation
│
└── Swin (Day 47) ──── windowed attention, hierarchical features
| NLP Concept | Vision Equivalent | Why It Works |
|---|---|---|
| Word token | Image patch | Both are discrete input units |
| Position embedding | Position embedding / relative bias | Sequence needs order information |
| BERT masking | MAE masking | Predict hidden parts → learn semantics |
| Knowledge distillation | DINO self-distillation | Soft targets carry richer signal |
| Sentence → [CLS] | Image → [CLS] | Special token for global summary |
Without looking at code, implement these from memory, then verify:
# 1. Patch embedding: image → sequence of patches
def patch_embed(image, patch_size=16):
"""Convert (B, C, H, W) image to (B, N, D) patch sequence."""
B, C, H, W = image.shape
P = patch_size
N = (H // P) * (W // P)
# Reshape into patches
patches = image.unfold(2, P, P).unfold(3, P, P) # (B, C, H/P, W/P, P, P)
patches = patches.contiguous().view(B, C, N, P * P)
patches = patches.permute(0, 2, 1, 3).reshape(B, N, C * P * P)
return patches # (B, N, patch_dim)
# 2. Random masking for MAE
def random_mask(x, mask_ratio=0.75):
"""Mask patches randomly, return visible subset."""
B, N, D = x.shape
keep = int(N * (1 - mask_ratio))
noise = torch.rand(B, N, device=x.device)
ids = noise.argsort(dim=1)
ids_keep = ids[:, :keep]
visible = torch.gather(x, 1, ids_keep.unsqueeze(-1).expand(-1, -1, D))
return visible, ids
# 3. Windowed attention partition
def window_partition(x, window_size=7):
"""(B, H, W, C) → (B*nW, M, M, C)"""
B, H, W, C = x.shape
M = window_size
x = x.view(B, H//M, M, W//M, M, C)
return x.permute(0, 1, 3, 2, 4, 5).reshape(-1, M, M, C)
# Compare representations: supervised ViT vs DINO vs MAE
import timm
import torch
# Load three models with different pretraining
models = {
'supervised': timm.create_model('vit_small_patch16_224', pretrained=True),
'dino': torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14'),
# MAE requires separate loading
}
# Extract features on same images → compare with CKA or linear probe
# Which features are most transferable?
The rest of Phase IV extends vision beyond 2D images:
| Remaining Days | What's New | Why It Matters for VLAs |
|---|---|---|
| Day 51: 3D & Depth | Monocular depth estimation | Robots perceive 3D world |
| Day 52: Point Clouds | PointNet, 3D transformers | Manipulation needs 3D reasoning |
| Day 53-54: Video | Temporal attention | Actions unfold over time |
| Day 55: DETR + SAM | Object detection + segmentation | Robots must identify objects |
| Day 56: VL Bridge | How to connect vision→language | The foundation for VLMs/VLAs |
You've mastered image understanding with transformers. Now: the real world is 3D, dynamic, and requires detecting individual objects — not just classifying scenes.
Phase IV has shown that images are just another token type. The rest of the phase extends this to depth, 3D, and video. Then Phase V connects vision to language — the direct precursor to VLAs.