Phase II · Days 15–21 · 17.5 hours
Building on the full transformer from Week 2, this week explores training recipes, efficiency variants, and the key architectural innovations that led to modern LLMs.
| Day | Topic | Focus |
|---|---|---|
| 15 | Training a Transformer | Warmup, label smoothing, stability |
| 16 | Stop & Reflect #1 | Consolidation |
| 17 | Efficient Attention | Flash Attention, sparse attention |
| 18 | KV Cache | Autoregressive inference optimization |
| 19 | Normalization + Activations | RMSNorm, SwiGLU |
| 20 | Mixture of Experts | Sparse activation, routing |
| 21 | BERT & Masked LM | Bidirectional encoders |