Phase IV — Vision: ViT, 3D, Video | Week 9 | 2.5 hours "Before you build VLMs, prove you understand vision transformers." — Phase IV checkpoint
Run your perception pipeline with different configurations and compare:
import time
def benchmark_pipeline(pipeline, image_paths, config_name="default"):
"""Benchmark perception pipeline on a set of images."""
times = []
n_objects = []
for path in image_paths:
start = time.time()
result = pipeline.process(path)
elapsed = time.time() - start
times.append(elapsed)
n_objects.append(len(result['objects']))
print(f"\n=== {config_name} ===")
print(f"Avg time: {sum(times)/len(times):.2f}s")
print(f"Avg objects: {sum(n_objects)/len(n_objects):.1f}")
print(f"Total time: {sum(times):.2f}s for {len(image_paths)} images")
return {
'config': config_name,
'avg_time': sum(times) / len(times),
'avg_objects': sum(n_objects) / len(n_objects),
}
# Ablation configurations
ablations = {
'Full pipeline': {'scene': True, 'depth': True, 'detection': True},
'No depth': {'scene': True, 'depth': False, 'detection': True},
'No scene features': {'scene': False, 'depth': True, 'detection': True},
'Detection only': {'scene': False, 'depth': False, 'detection': True},
}
| Metric | How to Measure |
|---|---|
| Detection precision | Manual review: are detected objects correct? |
| Depth quality | Relative ordering: are near objects darker in depth map? |
| Feature coherence | PCA visualization: do semantically similar regions cluster? |
| 3D accuracy | Do object centroids make spatial sense? |
Answer these 6 questions without looking at notes. Write 3-5 sentences per answer.
Explain how ViT processes an image. Cover: patch creation, embedding, positional encoding, and classification.
Your answer:
Space for writing. Key points: split into 16×16 patches, linear projection to embeddings, add learned positional embeddings, prepend [CLS] token, transformer encoder, classify from [CLS] output.
How does DINO learn without labels? What emerges in the attention maps?
Your answer:
Key points: student-teacher self-distillation, teacher is EMA of student, multi-crop strategy (student gets local crops, teacher gets global), centering + sharpening prevent collapse, attention maps discover object boundaries.
Why can a single camera estimate depth? What cues does the model learn?
Your answer:
Key points: monocular cues — relative size, occlusion, texture gradient, perspective, atmospheric haze. ViT's global attention helps reason about scene-wide cues. Training on diverse datasets with scale-invariant loss.
How does TimeSformer adapt ViT for video? What's the computational trick?
Your answer:
Key points: divided space-time attention — temporal attention (same patch across frames) + spatial attention (all patches within a frame). Reduces O((TN)²) to O(T²N + N²T). Separable position embeddings.
How is DETR different from traditional detectors like YOLO? What's the role of Hungarian matching?
Your answer:
Key points: no anchors, no NMS. Learned object queries predict a set of (class, box) pairs. Hungarian matching provides optimal bipartite assignment between predictions and ground truth. End-to-end differentiable.
Describe three ways to connect a vision encoder to an LLM. Which is simplest? Which compresses most?
Your answer:
Key points: (1) MLP projection (LLaVA) — simplest, N tokens. (2) Cross-attention (Flamingo) — gated layers, 64 resampled tokens. (3) Q-Former (BLIP-2) — 32 compressed queries, most compression. Modern trend: MLP is enough.
| Score | Meaning |
|---|---|
| 6/6 | Ready for Phase V — you understand vision deeply |
| 4-5/6 | Solid — review the weak areas, then proceed |
| 2-3/6 | Revisit Days 45-56 before continuing |
| 0-1/6 | Re-do the week; don't rush |
Write a paragraph connecting Phase IV back to the curriculum thread:
Phase I: Neural network foundations
Phase II: Attention + Transformers (text)
Phase III: LLM training + alignment
Phase IV: Vision transformers + 3D + video + detection ← YOU ARE HERE
Phase V: Vision-Language Models (connecting vision + language)
Phase VI: VLAs (vision + language + actions)
Key insight: The transformer architecture didn't change between Phase II and Phase IV. What changed was the tokenization — words became patches. The same architecture that generates text also classifies images, estimates depth, and detects objects.
Phase V connects everything: - CLIP (Day 59): Align vision and language in shared space - BLIP-2/Flamingo (Day 61): Add visual understanding to LLMs - LLaVA (Day 62): Enable visual conversations
VLMs can see and describe. After Phase V, you'll understand how to make a model that both sees and speaks. Phase VI will add the final piece: acting.
Phase IV is complete. You can make transformers see. Tomorrow starts Phase V: making transformers see AND understand language simultaneously. CLIP is the foundation.