Week 3: Variants + GPT

Phase II · Days 15–21 · 17.5 hours

Building on the full transformer from Week 2, this week explores training recipes, efficiency variants, and the key architectural innovations that led to modern LLMs.

Daily Lessons

Day	Topic	Focus
15	Training a Transformer	Warmup, label smoothing, stability
16	Stop & Reflect #1	Consolidation
17	Efficient Attention	Flash Attention, sparse attention
18	KV Cache	Autoregressive inference optimization
19	Normalization + Activations	RMSNorm, SwiGLU
20	Mixture of Experts	Sparse activation, routing
21	BERT & Masked LM	Bidirectional encoders

Study Notes Reference

02 — Attention Mechanism
03 — Transformers