Phase II — Attention, Transformers & Scaling | Week 4 | 2.5 hours "The unreasonable effectiveness of predicting the next token." — Ilya Sutskever
GPT uses only the transformer decoder — no encoder, no cross-attention. Just causal self-attention stacked deep.
Full Transformer (Vaswani 2017): GPT (Decoder-Only):
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Encoder │ │ Decoder │ │ Decoder │
│ │ │ │ │ │
│ Self-Attn │ │ Masked │ │ Causal │
│ (bidir.) │→│ Self-Attn │ │ Self-Attn │
│ │ │ │ │ │
│ FFN │ │ Cross-Attn │ │ FFN │
│ │ │ (to encoder) │ │ │
│ │ │ │ │ × N layers │
│ │ │ FFN │ │ │
│ │ │ │ │ │
│ │ │ × N layers │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
↑ source ↑ target ↑ everything
Needs paired data Needs paired data Just needs text!
(src → tgt) (src → tgt) Next-token prediction
Why decoder-only won: 1. Simplicity — One architecture, one training objective, one data type 2. Data efficiency — Only needs raw text (unlimited on the internet) 3. Generality — Any task can be framed as text completion 4. Emergent abilities — Scale unlocks capabilities not present in smaller versions
GPT-1 (2018) GPT-2 (2019) GPT-3 (2020) GPT-4 (2023)
117M params 1.5B params 175B params ~1.8T params (est.)
12 layers 48 layers 96 layers ~120 layers (est.)
768 dim 1600 dim 12288 dim ~?
BooksCorpus WebText (40GB) CommonCrawl+ ~?
(570GB)
Key idea: Key idea: Key idea: Key idea:
Transfer learning Zero-shot via In-context Multimodal +
(pretrain + finetune) task description learning (few-shot) RLHF + MoE (est.)
"Fine-tuning "Language models "Language models "The architecture
works for NLP" are unsupervised are few-shot works for everything
multitask learners" learners" (text, vision, code)"
The progression: 1. GPT-1: Proved that transformer pre-training + fine-tuning works for NLP 2. GPT-2: Showed that scale alone enables zero-shot task performance 3. GPT-3: Demonstrated in-context learning — no weight updates needed 4. GPT-4: Extended to multimodal, achieved near-human performance on benchmarks
$$\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1}; \theta)$$
Every position predicts the next token, supervised by the actual next token:
Input: [BOS] The cat sat on the
Target: The cat sat on the mat
Position 0: P("The" | [BOS]) ← predict "The"
Position 1: P("cat" | [BOS], The) ← predict "cat"
Position 2: P("sat" | [BOS], The, cat) ← predict "sat"
...
Every token is BOTH a prediction target AND context for later tokens.
Training is maximally efficient — every position contributes to the loss.
Input tokens → Token Embedding + Position Embedding
↓
┌─────────────────────┐
│ LayerNorm │ ← Pre-LN
│ Causal Self-Attn │
│ + Residual │
│ │
│ LayerNorm │ ← Pre-LN
│ FFN (GELU) │
│ + Residual │
└─────────────────────┘
× N layers
↓
LayerNorm (final)
↓
Linear → logits (vocab_size)
↓
softmax → P(next token)
GPT-2 specifics: - Pre-LN (not Post-LN like original transformer) - GELU activation (not ReLU) - Learned positional embeddings (not sinusoidal) - No bias in attention projections (GPT-3+/LLaMA) - Weight tying: token embedding = output projection (transposed)
On Day 14, you built a full encoder-decoder transformer for translation. nanoGPT is different:
| Day 14 Transformer | nanoGPT |
|---|---|
| Encoder + Decoder | Decoder only |
| Cross-attention | No cross-attention |
| Sinusoidal positions | Learned positions |
| Post-LN | Pre-LN |
| ReLU FFN | GELU FFN |
| Translation task | Language modeling |
| Small vocab (~10K) | Large vocab (50K+) |
| Short sequences (~50) | Medium sequences (1024) |
| Trained from scratch on toy data | Can reproduce GPT-2 |
The core of nanoGPT is ~300 lines. Here's the annotated architecture:
"""
nanoGPT — annotated core architecture.
Original: https://github.com/karpathy/nanoGPT
"""
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class CausalSelfAttention(nn.Module):
"""Multi-head causal self-attention."""
def __init__(self, n_embd, n_head, block_size, dropout=0.0, bias=False):
super().__init__()
assert n_embd % n_head == 0
self.n_head = n_head
self.n_embd = n_embd
self.dropout = dropout
# Key, Query, Value projections — all in one matrix for efficiency
self.c_attn = nn.Linear(n_embd, 3 * n_embd, bias=bias)
# Output projection
self.c_proj = nn.Linear(n_embd, n_embd, bias=bias)
self.attn_dropout = nn.Dropout(dropout)
self.resid_dropout = nn.Dropout(dropout)
# Causal mask — registered as buffer (not a parameter)
self.register_buffer(
"mask",
torch.tril(torch.ones(block_size, block_size))
.view(1, 1, block_size, block_size)
)
def forward(self, x):
B, T, C = x.size() # batch, sequence length, embedding dim
# Calculate Q, K, V in one shot
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
# Reshape: (B, T, C) → (B, n_head, T, head_dim)
head_dim = C // self.n_head
q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
# Attention: (B, n_head, T, T)
# Use PyTorch's SDPA which auto-selects Flash Attention when possible
att = F.scaled_dot_product_attention(
q, k, v,
attn_mask=None,
dropout_p=self.dropout if self.training else 0.0,
is_causal=True, # ← this enables the causal mask!
)
# Re-assemble heads: (B, n_head, T, head_dim) → (B, T, C)
att = att.transpose(1, 2).contiguous().view(B, T, C)
# Output projection
return self.resid_dropout(self.c_proj(att))
class MLP(nn.Module):
"""Feed-forward network with GELU activation."""
def __init__(self, n_embd, dropout=0.0, bias=False):
super().__init__()
self.c_fc = nn.Linear(n_embd, 4 * n_embd, bias=bias)
self.gelu = nn.GELU()
self.c_proj = nn.Linear(4 * n_embd, n_embd, bias=bias)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
return self.dropout(x)
class Block(nn.Module):
"""Transformer block: Pre-LN attention + FFN with residuals."""
def __init__(self, n_embd, n_head, block_size, dropout=0.0, bias=False):
super().__init__()
self.ln_1 = nn.LayerNorm(n_embd)
self.attn = CausalSelfAttention(n_embd, n_head, block_size, dropout, bias)
self.ln_2 = nn.LayerNorm(n_embd)
self.mlp = MLP(n_embd, dropout, bias)
def forward(self, x):
# Pre-LN: normalize BEFORE sublayer
x = x + self.attn(self.ln_1(x)) # residual + attention
x = x + self.mlp(self.ln_2(x)) # residual + FFN
return x
class GPT(nn.Module):
"""The full GPT language model."""
def __init__(self, vocab_size, block_size=1024, n_layer=12,
n_head=12, n_embd=768, dropout=0.0, bias=False):
super().__init__()
self.block_size = block_size
self.transformer = nn.ModuleDict(dict(
wte=nn.Embedding(vocab_size, n_embd), # token embeddings
wpe=nn.Embedding(block_size, n_embd), # position embeddings
drop=nn.Dropout(dropout),
h=nn.ModuleList([
Block(n_embd, n_head, block_size, dropout, bias)
for _ in range(n_layer)
]),
ln_f=nn.LayerNorm(n_embd), # final layer norm
))
# Language model head — shares weights with token embedding!
self.lm_head = nn.Linear(n_embd, vocab_size, bias=False)
self.transformer.wte.weight = self.lm_head.weight # weight tying
# Initialize weights
self.apply(self._init_weights)
# Special scaled init for residual projections (GPT-2 recipe)
for pn, p in self.named_parameters():
if pn.endswith('c_proj.weight'):
torch.nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * n_layer))
n_params = sum(p.numel() for p in self.parameters())
print(f"GPT model: {n_params/1e6:.1f}M parameters")
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
"""
idx: (B, T) token indices
targets: (B, T) target token indices (shifted by 1)
"""
B, T = idx.size()
assert T <= self.block_size, f"Sequence {T} > block_size {self.block_size}"
# Token + position embeddings
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
tok_emb = self.transformer.wte(idx) # (B, T, n_embd)
pos_emb = self.transformer.wpe(pos) # (T, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
# Transformer blocks
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
if targets is not None:
# Training: compute loss
logits = self.lm_head(x) # (B, T, vocab_size)
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1),
ignore_index=-1,
)
else:
# Inference: only compute last position
logits = self.lm_head(x[:, [-1], :]) # (B, 1, vocab_size)
loss = None
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""Autoregressive generation."""
for _ in range(max_new_tokens):
# Crop context to block_size
idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idx
import os
import requests
# Download Shakespeare dataset
data_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
data_path = "shakespeare.txt"
if not os.path.exists(data_path):
text = requests.get(data_url).text
with open(data_path, 'w') as f:
f.write(text)
else:
with open(data_path, 'r') as f:
text = f.read()
print(f"Dataset: {len(text)} characters")
# Character-level tokenizer (simple for this exercise)
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
print(f"Vocab size: {vocab_size}")
print(f"Chars: {''.join(chars[:50])}...")
# Train/val split
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
def get_batch(split, batch_size=64, block_size=256):
data_split = train_data if split == 'train' else val_data
ix = torch.randint(len(data_split) - block_size, (batch_size,))
x = torch.stack([data_split[i:i+block_size] for i in ix])
y = torch.stack([data_split[i+1:i+block_size+1] for i in ix])
return x, y
# Create a small GPT
model = GPT(
vocab_size=vocab_size,
block_size=256,
n_layer=6,
n_head=6,
n_embd=384,
dropout=0.2,
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# Training loop
for step in range(5000):
xb, yb = get_batch('train')
xb, yb = xb.to(device), yb.to(device)
logits, loss = model(xb, yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step % 500 == 0:
model.eval()
val_x, val_y = get_batch('val')
val_x, val_y = val_x.to(device), val_y.to(device)
_, val_loss = model(val_x, val_y)
model.train()
print(f"Step {step}: train_loss={loss.item():.4f}, val_loss={val_loss.item():.4f}")
# Generate!
model.eval()
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = model.generate(context, max_new_tokens=500, temperature=0.8, top_k=50)
print("\n--- Generated Shakespeare ---")
print(decode(generated[0].tolist()))
List EVERY difference between your Day 14 transformer and this nanoGPT:
| Feature | Day 14 Transformer | nanoGPT |
|---|---|---|
| Architecture type | ? | ? |
| Normalization | ? | ? |
| Activation | ? | ? |
| Positional encoding | ? | ? |
| Weight tying | ? | ? |
| Cross-attention | ? | ? |
| Initialization | ? | ? |
For the Shakespeare model (6 layers, 6 heads, 384 dim, vocab=65):
Calculate by hand:
1. Token embedding: vocab_size × n_embd = ?
2. Position embedding: block_size × n_embd = ?
3. Per block attention: 4 × n_embd² = ? (Q, K, V, Out projections)
4. Per block FFN: 2 × n_embd × 4*n_embd = ?
5. Per block LayerNorm: 2 × 2 × n_embd = ?
6. Total = ?
Compare with sum(p.numel() for p in model.parameters()). Do they match?
Generate text at steps 0, 500, 1000, 2000, 5000:
checkpoints = [0, 500, 1000, 2000, 5000]
# At each checkpoint, generate 200 characters and save
# Observe: random → character patterns → word patterns → Shakespeare-like
# Also: try temperature = [0.1, 0.5, 0.8, 1.0, 1.5, 2.0]
# What happens at each extreme?
Today you built the architecture that dominates modern AI. GPT's decoder-only design is what powers ChatGPT, Claude, LLaMA, and eventually the language backbone of VLA models. Tomorrow, you'll use this codebase as an ablation laboratory — systematically varying architecture choices (heads, layers, dim, activation, normalization) to understand what matters and why. This is how real ML research works.