Chapter 3: Convolutional Neural Networks — From LeNet to AlexNet (1989–2012)

The Problem with Fully Connected Networks for Images

Consider a modest 256×256 color image. Flattened, it has 256 × 256 × 3 = 196,608 input values. A fully connected hidden layer with just 1,000 neurons would require nearly 200 million parameters — for a single layer. This is computationally prohibitive, impossible to train with limited data, and ignores the spatial structure of images entirely.

Images have three fundamental properties that fully connected networks waste:

Local patterns: Edges, textures, and small shapes are defined by neighboring pixels
Translation invariance: A cat is still a cat whether it’s in the top-left or bottom-right of the image
Hierarchical structure: Complex features (faces) are composed of simpler features (eyes, noses), which are composed of even simpler features (edges, curves)

The Convolution Operation

A convolution slides a small filter (kernel) across the input, computing a dot product at each position:

\[(f * g)(i, j) = \sum_{m} \sum_{n} f(m, n) \cdot g(i-m, j-n)\]

In practice, a 3×3 kernel has just 9 learnable parameters, yet it can detect a specific local pattern (like a vertical edge) anywhere in the image. This achieves:

Parameter sharing: The same 9 weights are used at every spatial location
Translation equivariance: If the input shifts, the output shifts by the same amount
Locality: Each output depends only on a small neighborhood of the input

LeNet-5: The First Convolutional Network (1989–1998)

Yann LeCun developed LeNet for handwritten digit recognition at Bell Labs. The architecture was:

Input (32×32) → Conv(5×5, 6 filters) → Pool(2×2) → Conv(5×5, 16 filters) → Pool(2×2) → FC(120) → FC(84) → Output(10)

Key Design Decisions

Alternating convolution and pooling: Convolutions detect features; pooling reduces spatial dimensions and adds some translation invariance
Feature maps: Multiple filters at each layer, each learning a different feature
Increasing depth: More filters in later layers to capture increasingly complex patterns

LeNet was deployed at scale — it processed millions of checks for the US Postal Service and banks. This was arguably the first commercially successful deep learning system.

Average Pooling vs. Max Pooling

LeNet used average pooling (computing the mean of a local region). Later, max pooling (taking the maximum) became standard, as it acts as a stronger feature detector — “is this feature present anywhere in this region?”

\[\text{MaxPool}(x_{i,j}) = \max_{(m,n) \in \text{window}} x_{i+m, j+n}\]

The Dark Years (1998–2012)

Despite LeNet’s success, neural networks fell out of favor in the 2000s. Support Vector Machines (SVMs) dominated computer vision competitions because:

SVMs had convex optimization guarantees — no local minima
Kernel SVMs with hand-crafted features (SIFT, HOG) worked well
Neural networks were considered too finicky and slow to train
The available datasets (MNIST, CIFAR-10) were too small to show the advantage of deep networks

Two things were missing: enough data and enough compute.

ImageNet: The Dataset That Changed Everything (2009)

Fei-Fei Li’s ImageNet project assembled 14 million labeled images across 22,000 categories. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), started in 2010, used a subset of 1.2 million images across 1,000 categories.

This was the dataset that deep networks needed. Large, diverse, and challenging enough that hand-crafted features couldn’t hack it.

AlexNet: The Big Bang (2012)

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to ILSVRC and won by a landslide — 15.3% top-5 error versus 26.2% for the second-place entry using traditional methods. This nearly halved the error rate overnight.

Architecture

Input (224×224×3)
→ Conv(11×11, 96 filters, stride 4) → ReLU → MaxPool → LRN
→ Conv(5×5, 256 filters) → ReLU → MaxPool → LRN
→ Conv(3×3, 384 filters) → ReLU
→ Conv(3×3, 384 filters) → ReLU
→ Conv(3×3, 256 filters) → ReLU → MaxPool
→ FC(4096) → ReLU → Dropout
→ FC(4096) → ReLU → Dropout
→ FC(1000) → Softmax

What Made AlexNet Work

ReLU activation: 6× faster training than tanh (see Chapter 2)
GPU training: Split across two GTX 580 GPUs with 3GB each — this was the first major demonstration that GPUs could accelerate deep learning
Dropout: Randomly zeroing 50% of neurons during training prevented overfitting (see Chapter 6)
Data augmentation: Random crops, flips, and color jittering artificially expanded the training set
Large scale: 60 million parameters, trained on 1.2 million images

The Impact

AlexNet didn’t just win a competition — it ended an era. Within two years:

Every ILSVRC winner used deep CNNs
Computer vision researchers abandoned hand-crafted features
GPU manufacturers (NVIDIA) pivoted to support deep learning
Venture capital flooded into AI startups

The Post-AlexNet Revolution (2012–2015)

VGGNet (2014)

Karen Simonyan and Andrew Zisserman showed that using very small (3×3) filters consistently throughout the network was better than large filters. Key insight: two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.

VGG-16 and VGG-19 used this principle to reach 16 and 19 layers. The uniform architecture made it easy to understand and replicate.

GoogLeNet / Inception (2014)

The Inception module was a radically different approach: instead of choosing a single filter size, apply multiple filter sizes in parallel (1×1, 3×3, 5×5) and concatenate the results. The network learns which filter size is best for each region.

The clever use of 1×1 convolutions for dimensionality reduction (“bottleneck layers”) dramatically reduced computation while maintaining representational power.

Key Innovation: 1×1 Convolutions

A 1×1 convolution might seem pointless — it doesn’t look at spatial neighbors at all. But it performs a learned linear combination across channels, acting as a per-pixel fully connected layer. Uses:

Channel reduction: Reduce 256 channels to 64 before an expensive 3×3 convolution
Adding non-linearity: 1×1 conv + ReLU adds a non-linear transformation at minimal cost
Cross-channel interaction: Mix information between channels

Pooling Strategies

Pooling evolved significantly through this era:

Strategy	Description	Used In
Average Pooling	Mean of local region	LeNet
Max Pooling	Maximum of local region	AlexNet, VGG
Strided Convolution	Conv with stride > 1 replaces pooling	Modern networks
Global Average Pooling	Average over entire spatial dimension	GoogLeNet, ResNet

Global Average Pooling (GAP) deserves special mention: instead of flattening feature maps into a huge vector and using fully connected layers, GAP simply averages each feature map to a single number. This eliminates most parameters and reduces overfitting.

Code Example: A Simple CNN in PyTorch

# See code/ch03_cnn.py for the full training pipeline
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),  # Global Average Pooling
        )
        self.classifier = nn.Linear(128, num_classes)

The Depth Problem

By 2014, a clear trend had emerged: deeper networks performed better. VGG went from 8 layers (AlexNet) to 19. GoogLeNet reached 22 layers. But there was a wall.

Training networks deeper than ~20 layers didn’t just get harder — performance actually got worse. A 56-layer network performed worse than a 20-layer network, not because of overfitting, but because the optimization couldn’t find a good solution.

This “degradation problem” wasn’t caused by vanishing gradients alone (Batch Normalization had partially addressed that). Something more fundamental was wrong. The solution would come in 2015, and it would be surprisingly simple.

Key Takeaways

Convolutions exploit spatial structure through local filters, parameter sharing, and translation equivariance
LeNet (1989/1998) proved the concept with handwritten digit recognition
ImageNet (2009) provided the large-scale dataset that deep networks needed
AlexNet (2012) combined CNNs + ReLU + GPU + Dropout to shatter benchmarks and ignite the deep learning revolution
VGGNet showed that small filters and more layers is better than large filters
GoogLeNet introduced parallel filter paths and 1×1 convolutions for efficient computation
By 2014, everyone knew deeper was better — but training deep networks remained an unsolved problem

Previous Chapter: Activation Functions — The Gates That Shape Gradients

Next Chapter: The Vanishing Gradient Problem and Its Solutions

Back to Table of Contents