Day 2: CNNs & ResNets

Phase I · Week 1 · Day 2 of 112 · 2.5 hours

"The residual connection is the single most important architectural idea in deep learning. Without it, nothing deeper than ~20 layers would train. Transformers, ViTs, VLAs — all use it."


Previous	← Day 1: Computation Graphs & Backprop
Next	Day 3: RNN/LSTM Essentials →
Week	Week 1: DL Foundations
Phase	Phase I: DL Foundations
Curriculum	Full Curriculum

Why This Matters

Yesterday you built the gradient engine. Today you'll see why naively stacking layers breaks it. CNNs introduced feature hierarchies — the idea that early layers detect edges, middle layers detect parts, and deep layers detect objects. But going deeper than ~20 layers caused degradation — deeper networks performed worse than shallower ones, even on training data.

ResNets solved this with a shockingly simple idea: skip connections. This one trick enabled networks of 100+ layers and became the DNA of every modern architecture. When you study Vision Transformers (Day 45), you'll see the same residual connections. When you study VLAs (Day 92+), every action prediction head uses them. Residuals are non-negotiable.

1. Theory: Convolutional Neural Networks

1.1 Convolution as Learnable Feature Extraction

A convolution slides a learnable kernel (filter) across the input, computing dot products:

$$\text{output}(i, j) = \sum_{m}\sum_{n} \text{input}(i+m, j+n) \cdot \text{kernel}(m, n) + \text{bias}$$

Input (5×5)          Kernel (3×3)         Output (3×3)
┌─────────────┐      ┌─────────┐         ┌─────────┐
│ 1  0  1  0  1│      │ 1  0  1 │         │ ?  ?  ? │
│ 0  1  0  1  0│      │ 0  1  0 │  ──▶    │ ?  ?  ? │
│ 1  0  1  0  1│      │ 1  0  1 │         │ ?  ?  ? │
│ 0  1  0  1  0│      └─────────┘         └─────────┘
│ 1  0  1  0  1│
└─────────────┘

Key properties: - Parameter sharing: Same kernel across all spatial positions → translation equivariance - Locality: Each output depends only on a local patch → inductive bias for spatial data - Depth: Stack layers → grow the receptive field (area of input that affects each output)

1.2 Key Hyperparameters

Parameter	Effect	Formula
Kernel size $k$	Local patch size	—
Stride $s$	Step size	Reduces spatial dims
Padding $p$	Border handling	Preserves spatial dims if $p = \lfloor k/2 \rfloor$
Dilation $d$	Gaps in kernel	Larger receptive field without more params

Output size formula:

$$H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2p - d(k - 1) - 1}{s} + 1 \right\rfloor$$

1.3 The Feature Hierarchy

CNNs learn a hierarchy of increasingly abstract features:

Layer 1:   Edges, gradients, colors
              ╱  ╲
Layer 2:   Corners, textures, simple shapes
              ╱  ╲
Layer 3:   Parts (eyes, wheels, handles)
              ╱  ╲
Layer 4:   Objects (faces, cars, cups)
              ╱  ╲
Layer 5:   Scenes, contexts, relationships

This was first visualized by Zeiler & Fergus (2013) — each layer learns progressively more complex features by composing simpler ones from the layer below.

2. Theory: The Depth Problem

2.1 The Degradation Problem

Before ResNets (2015), practitioners observed a paradox:

Network Depth	Training Error	Expected	Actual
20 layers	Low	—	—
56 layers	Higher	Lower (more capacity)	Higher (degradation!)

This is not overfitting (training error goes up, not just test error). A 56-layer network should be able to represent a 20-layer network by setting the extra layers to identity — but SGD can't find that solution.

2.2 Why Plain Deep Networks Fail

The gradient of a deep network is a product of many Jacobians:

$$\frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial x_L} \prod_{\ell=1}^{L-1} \frac{\partial x_{\ell+1}}{\partial x_\ell}$$

If each factor's spectral norm is: - < 1: gradients vanish exponentially → early layers don't learn - > 1: gradients explode exponentially → training diverges - = 1: perfect, but impossible to maintain through nonlinearities

3. Theory: Residual Connections

3.1 The ResNet Idea

Instead of learning $H(x)$ directly, learn the residual $F(x) = H(x) - x$:

$$y = F(x) + x$$

        x ─────────────────────┐
        │                      │ (skip / shortcut)
        ▼                      │
   ┌─────────┐                 │
   │ Conv 3×3 │                │
   │ BN + ReLU│                │
   └────┬─────┘                │
        │                      │
   ┌────▼─────┐                │
   │ Conv 3×3 │                │
   │   BN     │                │
   └────┬─────┘                │
        │                      │
        ▼                      ▼
       (+) ◀───────────────────┘
        │
      ReLU
        │
        ▼
     output

3.2 Three Perspectives on Why Residuals Work

Perspective 1: Gradient Highway

The gradient through a residual block:

$$\frac{\partial y}{\partial x} = \frac{\partial F(x)}{\partial x} + I$$

The identity term $I$ creates a gradient highway — gradients can flow directly to early layers regardless of how deep the network is. Even if $\frac{\partial F}{\partial x} \approx 0$, the gradient is at least $I$.

Through $L$ residual blocks:

$$\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \prod_{\ell=0}^{L-1}\left(I + \frac{\partial F_\ell}{\partial x_\ell}\right)$$

Expanding this product yields $2^L$ terms — gradients flow through exponentially many paths of different lengths.

Perspective 2: Ensemble of Shallow Networks

Veit et al. (2016) showed that a ResNet of depth $L$ behaves like an ensemble of $2^L$ shallow networks of varying depths. Removing any single layer barely hurts performance, unlike a plain network where removing a layer is catastrophic.

Perspective 3: Making Identity Easy

If the optimal mapping is close to identity (common in deep networks), learning $F(x) \approx 0$ is much easier than learning $H(x) \approx x$ from scratch. The skip connection provides the right prior.

3.3 ResNet Block Variants

import torch.nn as nn

class BasicBlock(nn.Module):
    """ResNet basic block (used in ResNet-18, ResNet-34)."""

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Shortcut for dimension mismatch
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = self.shortcut(x)

        out = nn.functional.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += identity  # THE residual connection
        out = nn.functional.relu(out)
        return out

4. Implementation: ResNet-18 from Scratch

Build a complete ResNet-18 for CIFAR-10 — no torchvision.models.

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResNet18(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # Initial conv (CIFAR-10: 32×32, so use 3×3 instead of 7×7)
        self.conv1 = nn.Conv2d(3, 64, 3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)

        # 4 groups of residual blocks
        self.layer1 = self._make_layer(64, 64, num_blocks=2, stride=1)
        self.layer2 = self._make_layer(64, 128, num_blocks=2, stride=2)
        self.layer3 = self._make_layer(128, 256, num_blocks=2, stride=2)
        self.layer4 = self._make_layer(256, 512, num_blocks=2, stride=2)

        self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    def _make_layer(self, in_ch, out_ch, num_blocks, stride):
        layers = [BasicBlock(in_ch, out_ch, stride)]
        for _ in range(1, num_blocks):
            layers.append(BasicBlock(out_ch, out_ch, stride=1))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))  # 32×32
        x = self.layer1(x)                    # 32×32
        x = self.layer2(x)                    # 16×16
        x = self.layer3(x)                    # 8×8
        x = self.layer4(x)                    # 4×4
        x = self.avg_pool(x)                  # 1×1
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Count parameters
model = ResNet18()
n_params = sum(p.numel() for p in model.parameters())
print(f"ResNet-18 parameters: {n_params:,}")  # ~11.2M

4.1 Training on CIFAR-10

import torchvision
import torchvision.transforms as T
from torch.utils.data import DataLoader

transform_train = T.Compose([
    T.RandomCrop(32, padding=4),
    T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

transform_test = T.Compose([
    T.ToTensor(),
    T.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

trainset = torchvision.datasets.CIFAR10('./data', train=True, download=True,
                                         transform=transform_train)
testset = torchvision.datasets.CIFAR10('./data', train=False,
                                        transform=transform_test)

train_loader = DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
test_loader = DataLoader(testset, batch_size=256, shuffle=False, num_workers=2)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ResNet18().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9,
                            weight_decay=5e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=200)
criterion = nn.CrossEntropyLoss()

for epoch in range(200):
    model.train()
    total_loss, correct, total = 0, 0, 0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        outputs = model(images)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()

    scheduler.step()

    if (epoch + 1) % 20 == 0:
        acc = 100. * correct / total
        print(f"Epoch {epoch+1:3d} | Loss: {total_loss/len(train_loader):.3f} | "
              f"Acc: {acc:.1f}%")

4.2 Comparison: Plain Network vs ResNet

class PlainNet(nn.Module):
    """Same architecture but WITHOUT skip connections."""

    def __init__(self, num_classes=10, depth=18):
        super().__init__()
        layers = [nn.Conv2d(3, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU()]
        for _ in range(depth - 2):
            layers += [nn.Conv2d(64, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU()]
        self.features = nn.Sequential(*layers)
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(64, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

5. Exercises

Exercise 1: Gradient Norm Experiment

Compare gradient norms in a 50-layer plain network vs a 50-layer ResNet:

def measure_gradient_norms(model, x, y):
    """Return per-layer gradient norms after one forward+backward pass."""
    out = model(x)
    loss = F.cross_entropy(out, y)
    loss.backward()

    norms = []
    for name, param in model.named_parameters():
        if 'weight' in name and param.grad is not None and param.dim() >= 2:
            norms.append((name, param.grad.norm().item()))
    return norms

Expected observation: Plain network gradient norms decay exponentially from output to input. ResNet gradient norms stay relatively stable across all layers.

Exercise 2: Parameter Counting

Count parameters in ResNet-18 layer by layer. Fill in:

Component	Shape	Parameters
conv1	3→64, 3×3	?
layer1 block0 conv1	64→64, 3×3	?
layer1 block0 conv2	64→64, 3×3	?
...	...	...
fc	512→10	?
Total		?

Exercise 3: Receptive Field Calculation

For ResNet-18 on CIFAR-10, what is the theoretical receptive field at the output of each layer group? Use the formula:

$$r_\ell = r_{\ell-1} + (k_\ell - 1) \times \prod_{i=1}^{\ell-1} s_i$$

Does it cover the entire 32×32 input?

Exercise 4: Visualization

After training, visualize the first-layer convolutional filters (64 filters of shape 3×3×3). What patterns do you see? Compare with known Gabor-like edge detectors.

6. Key Takeaways

CNNs exploit spatial locality and translation equivariance through shared kernels
Deeper networks learn more abstract feature hierarchies, but degradation prevents plain networks from going beyond ~20 layers
The residual connection $y = F(x) + x$ solves degradation by creating gradient highways
Gradient through a residual block: $\frac{\partial y}{\partial x} = \frac{\partial F}{\partial x} + I$ — the identity term guarantees gradient flow
ResNets behave as ensembles of $2^L$ shallow networks — robust to layer removal
Batch normalization stabilizes training by normalizing activations
Skip connections appear in every modern architecture: Transformers, ViTs, U-Nets, VLAs

7. The Thread: Compression = Prediction = Intelligence

Residual connections encode a powerful prior: most of the information should pass through unchanged. The network only needs to learn the delta — what to add or subtract. This is a form of compression: instead of re-encoding all information at every layer, you only encode the new information.

This is exactly what differential coding does in signal processing, what residual learning does in video compression (P-frames), and what LoRA will do for efficient fine-tuning (Day 35). The pattern repeats: compress the change, not the whole signal.

8. Further Reading

He et al., "Deep Residual Learning for Image Recognition" (2015) — The original ResNet paper
Veit et al., "Residual Networks Behave Like Ensembles of Relatively Shallow Networks" (2016) — The ensemble interpretation
He et al., "Identity Mappings in Deep Residual Networks" (2016) — Pre-activation ResNets
Zeiler & Fergus, "Visualizing and Understanding Convolutional Networks" (2013) — Feature visualization
Kaiming initialization — torch.nn.init.kaiming_normal_ — essential for training deep networks