Phase II — Attention, Transformers & Scaling | Week 3 | 2.5 hours "The test of understanding is not whether you can follow a derivation, but whether you can produce one from scratch on a blank page."
This is a consolidation day. No new implementation. You've spent six days (10–15) building from Bahdanau attention to training a full transformer. Today you verify that it's all solid in your head — not just recognized, but retrievable.
Format: Paper and pen. No notes, no code, no peeking. Work through each section cold. Then compare with your materials and identify gaps.
State the problem: Given a decoder state (query) and a set of encoder states (keys/values), how do we compute a context vector?
Bahdanau (additive): - Write the energy function $e_{ij}$ - Write the attention weight formula $\alpha_{ij}$ - Write the context vector formula $c_i$ - How many learnable parameters? What are they?
Scaled dot-product: - Write $\text{Attention}(Q, K, V)$ in full - Why divide by $\sqrt{d_k}$? Derive the variance argument - What happens to softmax when scores are $\gg 1$?
Multi-head: - Write $\text{MultiHead}(Q, K, V)$ in full - What is $d_k$ per head if $d_{\text{model}} = 512$ and $h = 8$? - How many total parameters in the MHA block (with $W_O$)?
Self-check: If you got stuck on any step, that's a gap. Re-read the relevant day and try again tomorrow.
Include: - [ ] Encoder block with all sublayers labeled - [ ] Decoder block with all three sublayers labeled - [ ] Residual connections (drawn as arrows) - [ ] LayerNorm placement (Pre-LN style) - [ ] Positional encoding addition - [ ] The flow from source tokens → encoder → decoder → output logits - [ ] Causal mask notation on masked self-attention - [ ] $N \times$ notation showing the layer stack
Quality test: Could someone who hasn't taken this course understand the architecture from your diagram alone?
Answer each in 2–3 sentences, on paper:
What would break? How does the causal mask prevent it?
In the decoder, which sublayer uses which? Where do Q, K, V come from in each case?
Attention alone is linear in V (it computes weighted sums). What does the FFN add that attention cannot?
Draw the gradient flow path for both. Where does the gradient get modified by LayerNorm in each case?
Both use sinusoidal functions. What's fundamentally different about how they inject position?
Could the model still learn anything useful? What tasks would break? What might still work?
Write a one-page essay (handwritten or typed, ~500 words) answering:
Why does the transformer architecture work so much better than RNNs for sequence modeling?
Your essay should touch on: - Parallelism: Why can attention process an entire sequence at once? - Path length: What is the maximum path length between two tokens, and why does it matter for gradient flow and long-range dependencies? - Expressiveness: How do multi-head attention + FFN combine to create a powerful function approximator? - The scaling hypothesis: Why does stacking more layers and using more data keep improving performance? - The compression connection: How does this relate to the Day 5 insight that attention = selective compression?
This is not a literature review — it's a test of your understanding. Use your own words and reasoning.
Re-read your notes from Days 5 and 6 (information theory, compression = prediction). Then answer:
Focus on: - Section 3.1: Encoder and Decoder Stacks — compare with your diagram - Section 3.2: Attention — compare with your derivation - Section 3.3: Position-wise FFN — note the dimension choices - Section 3.5: Positional Encoding — compare with Day 13
Mark anything you understand now that you didn't when you first read it.
| Area | Confident | Shaky | Need to re-study |
|---|---|---|---|
| Bahdanau attention derivation | |||
| Scaled dot-product equation + scaling argument | |||
| Multi-head attention mechanics | |||
| Full transformer architecture (draw from memory) | |||
| Pre-LN vs Post-LN | |||
| Positional encoding (sinusoidal, RoPE, ALiBi) | |||
| Training recipe (warmup, label smoothing) | |||
| Why transformers > RNNs |
Rule: If any row is "Need to re-study," go back to that day before continuing to Day 17. Attention and transformers are the foundation for everything that follows — from GPT to vision transformers to VLAs.
This reflection day exists because understanding is compression. If you truly understand the transformer, you should be able to compress all of Days 10–15 into: (1) a single equation, (2) a single diagram, and (3) a single paragraph explaining why it works. If you can't compress it, you haven't understood it yet. That's what today tests.