Chapter 6: Regularization Strategies — Taming Overfitting

The Overfitting Problem

Deep neural networks are extremely powerful function approximators — often too powerful. A network with millions of parameters can easily memorize a training set rather than learning generalizable patterns. This is overfitting: low training error but high test error.

Regularization is the collection of techniques that prevent overfitting and improve generalization. These strategies became critical during the deep learning renaissance, enabling larger models trained on limited data.

L2 Regularization (Weight Decay)

The oldest and simplest regularization: add a penalty proportional to the squared magnitude of the weights to the loss function:

\[L_{\text{reg}} = L_{\text{data}} + \frac{\lambda}{2} \sum_{i} w_i^2\]

The gradient update becomes:

\[w \leftarrow w - \eta \frac{\partial L_{\text{data}}}{\partial w} - \eta \lambda w\]

The extra term $-\eta \lambda w$ shrinks weights toward zero at each step, hence “weight decay.”

Why It Works

Large weights amplify noise and create sharp decision boundaries. L2 regularization encourages smoother functions with smaller weights, which generalize better. It’s equivalent to placing a Gaussian prior on the weights in a Bayesian framework.

Decoupled Weight Decay

In Adam and other adaptive optimizers, the standard L2 penalty interacts poorly with the adaptive learning rate. Loshchilov and Hutter (2017) introduced AdamW, which applies weight decay directly to the weights rather than through the gradient:

\[w \leftarrow (1 - \eta \lambda) w - \eta \cdot \text{Adam\_step}\]

AdamW became the standard optimizer for Transformers and LLMs.

L1 Regularization

\[L_{\text{reg}} = L_{\text{data}} + \lambda \sum_{i} |w_i|\]

L1 drives weights exactly to zero, producing sparse models. This is useful for feature selection but less common in deep learning, where sparsity is typically achieved through activation functions (ReLU) or pruning.

Dropout (2014)

Dropout, introduced by Srivastava, Hinton, and others, was one of AlexNet’s secret weapons and became one of the most important regularization techniques in deep learning.

The Idea

During training, randomly set each neuron’s output to zero with probability $p$ (typically 0.5 for hidden layers):

\[\hat{h}_i = \begin{cases} 0 & \text{with probability } p \\ \frac{h_i}{1-p} & \text{with probability } 1-p \end{cases}\]

The scaling by $\frac{1}{1-p}$ (inverted dropout) ensures that the expected value remains unchanged, so no adjustment is needed at test time.

Why Dropout Works

Multiple complementary explanations:

  1. Ensemble effect: Each training step uses a different random sub-network. The final model is an implicit ensemble of $2^n$ sub-networks (where $n$ is the number of neurons)
  2. Prevents co-adaptation: Neurons cannot rely on specific other neurons being present, forcing each neuron to be independently useful
  3. Noise injection: Adding multiplicative noise acts as a regularizer, similar to data augmentation at the representation level
  4. Implicit weight sharing: Related to Bayesian model averaging over network configurations

Dropout in Practice

DropConnect (2013)

Instead of dropping neuron outputs, DropConnect drops individual weights:

\[\hat{W}_{ij} = \begin{cases} 0 & \text{with probability } p \\ W_{ij} & \text{with probability } 1-p \end{cases}\]

This is a more general form of dropout but is rarely used in practice due to higher computational cost.

Data Augmentation

Instead of regularizing the model, augment the data. Create new training examples by applying transformations that preserve the label:

Image Augmentation

Transformation Description
Random crop Crop a random sub-region of the image
Horizontal flip Mirror the image left-right
Color jitter Randomly adjust brightness, contrast, saturation
Rotation Small random rotations
Cutout / Random erasing Mask out random rectangular regions
Mixup Blend two images and their labels: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$
CutMix Replace a patch of one image with a patch from another

Why Data Augmentation Is So Effective

It directly addresses the root cause of overfitting: insufficient data diversity. By encoding known invariances (a flipped cat is still a cat), augmentation provides virtually unlimited training data.

AutoAugment (2018)

Google used reinforcement learning to search for the best augmentation policy for each dataset. The discovered policies often included unexpected combinations of transformations that outperformed human-designed augmentation.

RandAugment (2019)

A simpler alternative: randomly apply $N$ transformations, each with magnitude $M$. Just two hyperparameters to tune, yet competitive with AutoAugment.

Early Stopping

Monitor the validation loss during training and stop when it starts increasing:

  1. After each epoch, evaluate on a held-out validation set
  2. Save the model weights whenever validation loss improves
  3. If validation loss hasn’t improved for $k$ epochs (the “patience”), stop training
  4. Restore the best saved weights

Early stopping is simple, effective, and nearly free. It’s one of the most underrated regularization techniques.

Connection to L2 Regularization

Bishop (1995) showed that early stopping is approximately equivalent to L2 regularization. Stopping training early constrains the weights to remain near their initialization, similar to the effect of weight decay.

Label Smoothing (2015)

Instead of training with hard targets (one-hot vectors), use “soft” targets:

\[y_{\text{smooth}} = (1 - \epsilon) \cdot y_{\text{hard}} + \frac{\epsilon}{K}\]

For example, with $\epsilon = 0.1$ and $K = 1000$ classes, the target for the correct class is $0.9001$ instead of $1.0$, and each incorrect class gets $0.0001$ instead of $0$.

Why It Works

Hard one-hot targets encourage the model to produce overconfident predictions with very large logits. Label smoothing:

Label smoothing is standard in modern image classification (used in Inception v3, EfficientNet) and in training language models.

Stochastic Depth (2016)

A regularization technique specific to ResNets: during training, randomly drop entire residual blocks, making the network effectively shallower:

\[H_l = \begin{cases} f_l(H_{l-1}) + H_{l-1} & \text{with probability } p_l \\ H_{l-1} & \text{with probability } 1 - p_l \end{cases}\]

The survival probability $p_l$ decreases linearly with depth (early layers are dropped less frequently). This reduces training time by ~25% and improves generalization by acting as an implicit ensemble over networks of different depths.

Spectral Normalization (2018)

Constrains the spectral norm (largest singular value) of weight matrices to be at most 1:

\[\bar{W} = \frac{W}{\sigma(W)}\]

where $\sigma(W)$ is the largest singular value. This is especially important for training GANs (Chapter 9), where it stabilizes the discriminator.

Modern Regularization: What’s Used Today

Technique Where Used Still Common?
Weight decay (AdamW) Everywhere ✅ Yes
Dropout MLPs, some attention ⚠️ Less in Transformers
Data augmentation Vision ✅ Yes, critical
Label smoothing Classification ✅ Yes
Early stopping All ✅ Yes
Stochastic depth Deep ResNets ✅ Yes (ViT, ConvNeXt)
Gradient noise Training ⚠️ Sometimes
Mixup / CutMix Vision ✅ Yes

The modern trend is toward implicit regularization through:

Code Example: Dropout and Data Augmentation

# See code/ch06_regularization.py for full experiment
import torch.nn as nn
from torchvision import transforms

# Data augmentation pipeline
augmentation = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.4, 0.4, 0.4),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
])

# Model with dropout
model = nn.Sequential(
    nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.5),
    nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(128, 10),
)

Key Takeaways


Previous Chapter: Recurrent Networks and Gating Mechanisms

Next Chapter: Residual Networks and Skip Connections

Back to Table of Contents