Chapter 3: Convolutional Neural Networks — From LeNet to AlexNet (1989–2012)

The Problem with Fully Connected Networks for Images

Consider a modest 256×256 color image. Flattened, it has 256 × 256 × 3 = 196,608 input values. A fully connected hidden layer with just 1,000 neurons would require nearly 200 million parameters — for a single layer. This is computationally prohibitive, impossible to train with limited data, and ignores the spatial structure of images entirely.

Images have three fundamental properties that fully connected networks waste:

  1. Local patterns: Edges, textures, and small shapes are defined by neighboring pixels
  2. Translation invariance: A cat is still a cat whether it’s in the top-left or bottom-right of the image
  3. Hierarchical structure: Complex features (faces) are composed of simpler features (eyes, noses), which are composed of even simpler features (edges, curves)

The Convolution Operation

A convolution slides a small filter (kernel) across the input, computing a dot product at each position:

\[(f * g)(i, j) = \sum_{m} \sum_{n} f(m, n) \cdot g(i-m, j-n)\]

In practice, a 3×3 kernel has just 9 learnable parameters, yet it can detect a specific local pattern (like a vertical edge) anywhere in the image. This achieves:

LeNet-5: The First Convolutional Network (1989–1998)

Yann LeCun developed LeNet for handwritten digit recognition at Bell Labs. The architecture was:

Input (32×32) → Conv(5×5, 6 filters) → Pool(2×2) → Conv(5×5, 16 filters) → Pool(2×2) → FC(120) → FC(84) → Output(10)

Key Design Decisions

  1. Alternating convolution and pooling: Convolutions detect features; pooling reduces spatial dimensions and adds some translation invariance
  2. Feature maps: Multiple filters at each layer, each learning a different feature
  3. Increasing depth: More filters in later layers to capture increasingly complex patterns

LeNet was deployed at scale — it processed millions of checks for the US Postal Service and banks. This was arguably the first commercially successful deep learning system.

Average Pooling vs. Max Pooling

LeNet used average pooling (computing the mean of a local region). Later, max pooling (taking the maximum) became standard, as it acts as a stronger feature detector — “is this feature present anywhere in this region?”

\[\text{MaxPool}(x_{i,j}) = \max_{(m,n) \in \text{window}} x_{i+m, j+n}\]

The Dark Years (1998–2012)

Despite LeNet’s success, neural networks fell out of favor in the 2000s. Support Vector Machines (SVMs) dominated computer vision competitions because:

Two things were missing: enough data and enough compute.

ImageNet: The Dataset That Changed Everything (2009)

Fei-Fei Li’s ImageNet project assembled 14 million labeled images across 22,000 categories. The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), started in 2010, used a subset of 1.2 million images across 1,000 categories.

This was the dataset that deep networks needed. Large, diverse, and challenging enough that hand-crafted features couldn’t hack it.

AlexNet: The Big Bang (2012)

In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to ILSVRC and won by a landslide — 15.3% top-5 error versus 26.2% for the second-place entry using traditional methods. This nearly halved the error rate overnight.

Architecture

Input (224×224×3)
→ Conv(11×11, 96 filters, stride 4) → ReLU → MaxPool → LRN
→ Conv(5×5, 256 filters) → ReLU → MaxPool → LRN
→ Conv(3×3, 384 filters) → ReLU
→ Conv(3×3, 384 filters) → ReLU
→ Conv(3×3, 256 filters) → ReLU → MaxPool
→ FC(4096) → ReLU → Dropout
→ FC(4096) → ReLU → Dropout
→ FC(1000) → Softmax

What Made AlexNet Work

  1. ReLU activation: 6× faster training than tanh (see Chapter 2)
  2. GPU training: Split across two GTX 580 GPUs with 3GB each — this was the first major demonstration that GPUs could accelerate deep learning
  3. Dropout: Randomly zeroing 50% of neurons during training prevented overfitting (see Chapter 6)
  4. Data augmentation: Random crops, flips, and color jittering artificially expanded the training set
  5. Large scale: 60 million parameters, trained on 1.2 million images

The Impact

AlexNet didn’t just win a competition — it ended an era. Within two years:

The Post-AlexNet Revolution (2012–2015)

VGGNet (2014)

Karen Simonyan and Andrew Zisserman showed that using very small (3×3) filters consistently throughout the network was better than large filters. Key insight: two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.

VGG-16 and VGG-19 used this principle to reach 16 and 19 layers. The uniform architecture made it easy to understand and replicate.

GoogLeNet / Inception (2014)

The Inception module was a radically different approach: instead of choosing a single filter size, apply multiple filter sizes in parallel (1×1, 3×3, 5×5) and concatenate the results. The network learns which filter size is best for each region.

The clever use of 1×1 convolutions for dimensionality reduction (“bottleneck layers”) dramatically reduced computation while maintaining representational power.

Key Innovation: 1×1 Convolutions

A 1×1 convolution might seem pointless — it doesn’t look at spatial neighbors at all. But it performs a learned linear combination across channels, acting as a per-pixel fully connected layer. Uses:

Pooling Strategies

Pooling evolved significantly through this era:

Strategy Description Used In
Average Pooling Mean of local region LeNet
Max Pooling Maximum of local region AlexNet, VGG
Strided Convolution Conv with stride > 1 replaces pooling Modern networks
Global Average Pooling Average over entire spatial dimension GoogLeNet, ResNet

Global Average Pooling (GAP) deserves special mention: instead of flattening feature maps into a huge vector and using fully connected layers, GAP simply averages each feature map to a single number. This eliminates most parameters and reduces overfitting.

Code Example: A Simple CNN in PyTorch

# See code/ch03_cnn.py for the full training pipeline
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),  # Global Average Pooling
        )
        self.classifier = nn.Linear(128, num_classes)

The Depth Problem

By 2014, a clear trend had emerged: deeper networks performed better. VGG went from 8 layers (AlexNet) to 19. GoogLeNet reached 22 layers. But there was a wall.

Training networks deeper than ~20 layers didn’t just get harder — performance actually got worse. A 56-layer network performed worse than a 20-layer network, not because of overfitting, but because the optimization couldn’t find a good solution.

This “degradation problem” wasn’t caused by vanishing gradients alone (Batch Normalization had partially addressed that). Something more fundamental was wrong. The solution would come in 2015, and it would be surprisingly simple.

Key Takeaways


Previous Chapter: Activation Functions — The Gates That Shape Gradients

Next Chapter: The Vanishing Gradient Problem and Its Solutions

Back to Table of Contents