Consider a modest 256×256 color image. Flattened, it has 256 × 256 × 3 = 196,608 input values. A fully connected hidden layer with just 1,000 neurons would require nearly 200 million parameters — for a single layer. This is computationally prohibitive, impossible to train with limited data, and ignores the spatial structure of images entirely.
Images have three fundamental properties that fully connected networks waste:
A convolution slides a small filter (kernel) across the input, computing a dot product at each position:
\[(f * g)(i, j) = \sum_{m} \sum_{n} f(m, n) \cdot g(i-m, j-n)\]In practice, a 3×3 kernel has just 9 learnable parameters, yet it can detect a specific local pattern (like a vertical edge) anywhere in the image. This achieves:
Yann LeCun developed LeNet for handwritten digit recognition at Bell Labs. The architecture was:
Input (32×32) → Conv(5×5, 6 filters) → Pool(2×2) → Conv(5×5, 16 filters) → Pool(2×2) → FC(120) → FC(84) → Output(10)
LeNet was deployed at scale — it processed millions of checks for the US Postal Service and banks. This was arguably the first commercially successful deep learning system.
LeNet used average pooling (computing the mean of a local region). Later, max pooling (taking the maximum) became standard, as it acts as a stronger feature detector — “is this feature present anywhere in this region?”
\[\text{MaxPool}(x_{i,j}) = \max_{(m,n) \in \text{window}} x_{i+m, j+n}\]Despite LeNet’s success, neural networks fell out of favor in the 2000s. Support Vector Machines (SVMs) dominated computer vision competitions because:
Two things were missing: enough data and enough compute.
Fei-Fei Li’s ImageNet project assembled 14 million labeled images across 22,000 categories. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), started in 2010, used a subset of 1.2 million images across 1,000 categories.
This was the dataset that deep networks needed. Large, diverse, and challenging enough that hand-crafted features couldn’t hack it.
In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to ILSVRC and won by a landslide — 15.3% top-5 error versus 26.2% for the second-place entry using traditional methods. This nearly halved the error rate overnight.
Input (224×224×3)
→ Conv(11×11, 96 filters, stride 4) → ReLU → MaxPool → LRN
→ Conv(5×5, 256 filters) → ReLU → MaxPool → LRN
→ Conv(3×3, 384 filters) → ReLU
→ Conv(3×3, 384 filters) → ReLU
→ Conv(3×3, 256 filters) → ReLU → MaxPool
→ FC(4096) → ReLU → Dropout
→ FC(4096) → ReLU → Dropout
→ FC(1000) → Softmax
AlexNet didn’t just win a competition — it ended an era. Within two years:
Karen Simonyan and Andrew Zisserman showed that using very small (3×3) filters consistently throughout the network was better than large filters. Key insight: two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.
VGG-16 and VGG-19 used this principle to reach 16 and 19 layers. The uniform architecture made it easy to understand and replicate.
The Inception module was a radically different approach: instead of choosing a single filter size, apply multiple filter sizes in parallel (1×1, 3×3, 5×5) and concatenate the results. The network learns which filter size is best for each region.
The clever use of 1×1 convolutions for dimensionality reduction (“bottleneck layers”) dramatically reduced computation while maintaining representational power.
A 1×1 convolution might seem pointless — it doesn’t look at spatial neighbors at all. But it performs a learned linear combination across channels, acting as a per-pixel fully connected layer. Uses:
Pooling evolved significantly through this era:
| Strategy | Description | Used In |
|---|---|---|
| Average Pooling | Mean of local region | LeNet |
| Max Pooling | Maximum of local region | AlexNet, VGG |
| Strided Convolution | Conv with stride > 1 replaces pooling | Modern networks |
| Global Average Pooling | Average over entire spatial dimension | GoogLeNet, ResNet |
Global Average Pooling (GAP) deserves special mention: instead of flattening feature maps into a huge vector and using fully connected layers, GAP simply averages each feature map to a single number. This eliminates most parameters and reduces overfitting.
# See code/ch03_cnn.py for the full training pipeline
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(1), # Global Average Pooling
)
self.classifier = nn.Linear(128, num_classes)
By 2014, a clear trend had emerged: deeper networks performed better. VGG went from 8 layers (AlexNet) to 19. GoogLeNet reached 22 layers. But there was a wall.
Training networks deeper than ~20 layers didn’t just get harder — performance actually got worse. A 56-layer network performed worse than a 20-layer network, not because of overfitting, but because the optimization couldn’t find a good solution.
This “degradation problem” wasn’t caused by vanishing gradients alone (Batch Normalization had partially addressed that). Something more fundamental was wrong. The solution would come in 2015, and it would be surprisingly simple.
Previous Chapter: Activation Functions — The Gates That Shape Gradients
Next Chapter: The Vanishing Gradient Problem and Its Solutions