← Back to Curriculum

Week 3: Variants + GPT

Phase II · Days 15–21 · 17.5 hours

Building on the full transformer from Week 2, this week explores training recipes, efficiency variants, and the key architectural innovations that led to modern LLMs.

Daily Lessons

Day Topic Focus
15 Training a Transformer Warmup, label smoothing, stability
16 Stop & Reflect #1 Consolidation
17 Efficient Attention Flash Attention, sparse attention
18 KV Cache Autoregressive inference optimization
19 Normalization + Activations RMSNorm, SwiGLU
20 Mixture of Experts Sparse activation, routing
21 BERT & Masked LM Bidirectional encoders

Study Notes Reference