Phase I · Week 1 · Day 3 of 112 · 2.5 hours
"The vanishing gradient problem in RNNs is not a bug — it's the fundamental reason attention had to be invented. Understanding it deeply is understanding why transformers exist."
| Previous | ← Day 2: CNN & ResNets |
| Next | Day 4: Seq2Seq & The Bottleneck → |
| Week | Week 1: DL Foundations |
| Phase | Phase I: DL Foundations |
| Curriculum | Full Curriculum |
CNNs process spatial data. But language, robot action sequences, and sensor streams are sequential — each element depends on what came before. RNNs were the first architectures to model sequences by maintaining a hidden state that accumulates information over time.
But RNNs have a fatal flaw: they can't remember long-range dependencies. The gradients literally vanish or explode when backpropagating through time. LSTMs partially fix this with gating mechanisms, but the fundamental problem — compressing an entire sequence into a fixed-size hidden state — remains. Tomorrow (Day 4), you'll see how seq2seq makes this bottleneck explicit. And in Week 2 (Day 10), attention will break through it entirely.
The thread from RNN → LSTM → attention → transformer → VLA is direct and unbroken. You must understand each step to see why the next was necessary.
A vanilla RNN processes a sequence $x_1, x_2, \ldots, x_T$ one step at a time:
$$h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)$$ $$y_t = W_{hy} \cdot h_t + b_y$$
where: - $h_t \in \mathbb{R}^d$ is the hidden state at time $t$ - $W_{hh} \in \mathbb{R}^{d \times d}$ is the recurrent weight matrix - $W_{xh} \in \mathbb{R}^{d \times n}$ maps input to hidden - $W_{hy} \in \mathbb{R}^{m \times d}$ maps hidden to output
x₁ x₂ x₃ x₄
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ │─▶│ │─▶│ │─▶│ │
│ h₁ │ │ h₂ │ │ h₃ │ │ h₄ │
│ │ │ │ │ │ │ │
└──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
│ │ │ │
▼ ▼ ▼ ▼
y₁ y₂ y₃ y₄
Key insight: The SAME weights (W_hh, W_xh) are reused at every time step.
This is weight sharing across time, analogous to CNNs sharing across space.
To compute gradients, we "unroll" the RNN across time and apply standard backprop. The loss at time $T$ depends on all previous hidden states:
$$\frac{\partial L_T}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_T}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_{hh}}$$
The critical term is $\frac{\partial h_T}{\partial h_t}$, which chains through $T - t$ time steps:
$$\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1} \frac{\partial h_{k+1}}{\partial h_k} = \prod_{k=t}^{T-1} W_{hh}^T \cdot \text{diag}(\tanh'(z_k))$$
where $z_k = W_{hh} h_{k-1} + W_{xh} x_k + b_h$ and $\tanh'(\cdot) \in (0, 1]$.
Each factor in the product has norm bounded by:
$$\left\|\frac{\partial h_{k+1}}{\partial h_k}\right\| \leq \|W_{hh}\| \cdot \|\text{diag}(\tanh'(z_k))\| \leq \sigma_{\max}(W_{hh}) \cdot 1$$
where $\sigma_{\max}$ is the largest singular value. Over $T - t$ steps:
$$\left\|\frac{\partial h_T}{\partial h_t}\right\| \leq \sigma_{\max}(W_{hh})^{T-t}$$
This gives us three regimes:
| $\sigma_{\max}(W_{hh})$ | Behavior | Gradient | Result |
|---|---|---|---|
| $< 1$ | Vanishing | $\to 0$ exponentially | Can't learn long-range dependencies |
| $= 1$ | Stable | Maintained | Ideal but fragile |
| $> 1$ | Exploding | $\to \infty$ exponentially | Training diverges |
When $\frac{\partial L}{\partial h_t} \approx 0$ for small $t$: - The network cannot learn from early tokens in long sequences - Context from 50+ steps ago is effectively invisible - The model develops a recency bias — only the last few tokens matter
Gradient clipping fixes exploding gradients:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
But there is no simple fix for vanishing gradients. You need a fundamentally different architecture. Enter LSTM.
LSTM adds a cell state $c_t$ — a separate memory path that can carry information across many time steps without being squashed through nonlinearities. Three gates control what to forget, what to write, and what to read.
$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}$$ $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}$$ $$\tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \quad \text{(candidate)}$$ $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad \text{(cell state update)}$$ $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}$$ $$h_t = o_t \odot \tanh(c_t) \quad \text{(hidden state)}$$
where $\sigma$ is sigmoid, $\odot$ is element-wise multiplication, and $[\cdot, \cdot]$ is concatenation.
Forget Gate (f_t)
"What to erase from memory"
│
c_{t-1} ──────(×)────┤
│ Input Gate (i_t)
│ "What new info to write"
│ │
(+)────────(×)──── c̃_t (candidate)
│
c_t ─────────────────────────▶ c_{t+1}
│
tanh
│
(×)──── Output Gate (o_t)
│ "What to expose as output"
│
h_t ─────────────────────────▶ h_{t+1}
| Gate | Value | Meaning |
|---|---|---|
| Forget $f_t \approx 1$ | Keep | "This memory is still relevant" |
| Forget $f_t \approx 0$ | Erase | "This memory is no longer needed" |
| Input $i_t \approx 1$ | Write | "Store this new information" |
| Input $i_t \approx 0$ | Ignore | "Nothing important at this step" |
| Output $o_t \approx 1$ | Read | "This memory is relevant to output now" |
| Output $o_t \approx 0$ | Hide | "Memory exists but isn't needed for output" |
The gradient through the cell state:
$$\frac{\partial c_t}{\partial c_{t-1}} = f_t$$
If the forget gate learns $f_t \approx 1$, the gradient flows unchanged through the cell state — no squashing, no matrix multiplication, just element-wise scaling. This is the cell state's role: a gradient highway (same principle as ResNet skip connections!).
Over $T - t$ steps:
$$\frac{\partial c_T}{\partial c_t} = \prod_{k=t+1}^{T} f_k$$
If forget gates stay near 1, gradients persist. The network learns which information to maintain through the gate values.
The Gated Recurrent Unit (Cho et al., 2014) merges forget and input gates into a single "update gate" and removes the cell state:
$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t]) \quad \text{(update gate)}$$ $$r_t = \sigma(W_r \cdot [h_{t-1}, x_t]) \quad \text{(reset gate)}$$ $$\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])$$ $$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$
GRU has fewer parameters than LSTM and often performs comparably. In practice, LSTM and GRU perform similarly — the choice is empirical.
import torch
import torch.nn as nn
import torch.nn.functional as F
class VanillaRNNCell(nn.Module):
"""RNN cell implemented from scratch (no nn.RNN)."""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# Combined weight matrices for efficiency
self.W_xh = nn.Linear(input_size, hidden_size, bias=False)
self.W_hh = nn.Linear(hidden_size, hidden_size, bias=True)
def forward(self, x_t, h_prev):
"""
x_t: (batch, input_size) — input at time t
h_prev: (batch, hidden_size) — previous hidden state
Returns: h_t (batch, hidden_size)
"""
h_t = torch.tanh(self.W_xh(x_t) + self.W_hh(h_prev))
return h_t
class LSTMCell(nn.Module):
"""LSTM cell implemented from scratch (no nn.LSTM)."""
def __init__(self, input_size, hidden_size):
super().__init__()
self.hidden_size = hidden_size
# All four gates in one matrix for efficiency
# Order: input, forget, candidate, output
self.W_x = nn.Linear(input_size, 4 * hidden_size, bias=False)
self.W_h = nn.Linear(hidden_size, 4 * hidden_size, bias=True)
def forward(self, x_t, h_prev, c_prev):
"""
x_t: (batch, input_size)
h_prev: (batch, hidden_size)
c_prev: (batch, hidden_size)
Returns: (h_t, c_t)
"""
# Compute all four gates simultaneously
gates = self.W_x(x_t) + self.W_h(h_prev) # (batch, 4*hidden)
# Split into four gates
i, f, g, o = gates.chunk(4, dim=1)
i = torch.sigmoid(i) # input gate
f = torch.sigmoid(f) # forget gate
g = torch.tanh(g) # candidate
o = torch.sigmoid(o) # output gate
# Cell state update
c_t = f * c_prev + i * g
# Hidden state
h_t = o * torch.tanh(c_t)
return h_t, c_t
class CharRNN(nn.Module):
"""Character-level language model using our custom LSTM."""
def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.embed = nn.Embedding(vocab_size, embed_size)
self.cells = nn.ModuleList([
LSTMCell(embed_size if i == 0 else hidden_size, hidden_size)
for i in range(num_layers)
])
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, states=None):
"""
x: (batch, seq_len) — character indices
Returns: logits (batch, seq_len, vocab_size)
"""
batch_size, seq_len = x.shape
if states is None:
states = [(torch.zeros(batch_size, self.hidden_size, device=x.device),
torch.zeros(batch_size, self.hidden_size, device=x.device))
for _ in range(self.num_layers)]
embeds = self.embed(x) # (batch, seq_len, embed_size)
outputs = []
for t in range(seq_len):
inp = embeds[:, t, :] # (batch, embed_size)
new_states = []
for i, cell in enumerate(self.cells):
h_prev, c_prev = states[i]
h_t, c_t = cell(inp, h_prev, c_prev)
new_states.append((h_t, c_t))
inp = h_t # Feed to next layer
states = new_states
outputs.append(h_t)
outputs = torch.stack(outputs, dim=1) # (batch, seq_len, hidden)
logits = self.fc(outputs) # (batch, seq_len, vocab)
return logits, states
# --- Data preparation (tiny Shakespeare) ---
def load_shakespeare():
"""Load a small text corpus. Replace with actual file path."""
text = """
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
All:
Resolved. resolved.
"""
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}
data = torch.tensor([stoi[ch] for ch in text], dtype=torch.long)
return data, stoi, itos, len(chars)
def train_char_rnn(model, data, epochs=500, seq_len=64, lr=0.003):
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
for epoch in range(epochs):
# Random starting position
start = torch.randint(0, len(data) - seq_len - 1, (1,)).item()
x = data[start:start + seq_len].unsqueeze(0) # (1, seq_len)
y = data[start + 1:start + seq_len + 1].unsqueeze(0) # (1, seq_len)
logits, _ = model(x)
loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
optimizer.step()
if (epoch + 1) % 50 == 0:
print(f"Epoch {epoch+1:4d} | Loss: {loss.item():.4f}")
data, stoi, itos, vocab_size = load_shakespeare()
model = CharRNN(vocab_size, embed_size=32, hidden_size=128, num_layers=2)
train_char_rnn(model, data)
This is the most important experiment of the day. Measure how gradients vanish:
def gradient_norm_experiment(hidden_size=64, seq_lengths=[10, 50, 100, 200]):
"""Measure |∂L/∂h₁| vs |∂L/∂hₜ| for increasing sequence lengths."""
results = {}
for T in seq_lengths:
rnn = VanillaRNNCell(hidden_size, hidden_size)
x = torch.randn(1, T, hidden_size)
h = torch.zeros(1, hidden_size, requires_grad=True)
# Forward pass — save all hidden states
hiddens = [h]
for t in range(T):
h = rnn(x[:, t, :], h)
hiddens.append(h)
# Loss on final hidden state
loss = hiddens[-1].sum()
loss.backward()
# Measure gradient norm at first hidden state
grad_norm = hiddens[0].grad.norm().item() if hiddens[0].grad is not None else 0
results[T] = grad_norm
print(f"T={T:3d} | |∂L/∂h₁| = {grad_norm:.6f}")
return results
# Run and plot
results = gradient_norm_experiment()
Expected: Gradient norms decrease exponentially with sequence length. At $T = 200$, the gradient at $h_1$ is essentially zero.
Repeat Exercise 1 with LSTM. Compare the gradient norms. Does LSTM maintain gradients over longer sequences?
LSTM forget gates are often initialized with a bias of 1.0 (Jozefowicz et al., 2015) so that $f_t \approx \sigma(1) \approx 0.73$ initially — biasing toward remembering. Implement this:
# After creating the LSTM cell:
for name, param in lstm_cell.named_parameters():
if 'bias' in name:
n = param.size(0) // 4
param.data[n:2*n].fill_(1.0) # forget gate bias = 1
Train with and without this initialization. Does it help?
After training the CharRNN, implement autoregressive generation:
@torch.no_grad()
def generate(model, itos, stoi, seed_text="First", max_len=200, temperature=0.8):
model.eval()
chars = list(seed_text)
x = torch.tensor([[stoi[ch] for ch in chars]])
states = None
# Process seed
logits, states = model(x, states)
# Generate
for _ in range(max_len):
last_logits = logits[0, -1, :] / temperature
probs = F.softmax(last_logits, dim=0)
next_char_idx = torch.multinomial(probs, 1).item()
chars.append(itos[next_char_idx])
x = torch.tensor([[next_char_idx]])
logits, states = model(x, states)
return ''.join(chars)
What happens with $\text{temperature} \to 0$ vs $\text{temperature} \to \infty$?
The vanishing gradient problem is fundamentally a compression failure. The hidden state $h_t$ tries to compress the entire history $(x_1, \ldots, x_t)$ into a fixed-size vector. As the sequence grows, information from early tokens gets overwritten — the compression becomes too lossy.
LSTM improves this by making the compression selective (gates choose what to keep), but it's still a fixed-capacity bottleneck. The real breakthrough comes when we abandon the idea of a single compressed state and instead allow the model to dynamically attend to any part of the history. That's attention, and it arrives in Day 10.
The deep lesson: a good compressor must be adaptive. Fixed-rate compression (RNN) loses information. Variable-rate, content-dependent compression (attention) preserves what matters. This principle will echo through the entire curriculum — from LSTM gates to attention weights to VLA action tokens.