Chapter 6: Regularization Strategies — Taming Overfitting

The Overfitting Problem

Deep neural networks are extremely powerful function approximators — often too powerful. A network with millions of parameters can easily memorize a training set rather than learning generalizable patterns. This is overfitting: low training error but high test error.

Regularization is the collection of techniques that prevent overfitting and improve generalization. These strategies became critical during the deep learning renaissance, enabling larger models trained on limited data.

L2 Regularization (Weight Decay)

The oldest and simplest regularization: add a penalty proportional to the squared magnitude of the weights to the loss function:

\[L_{\text{reg}} = L_{\text{data}} + \frac{\lambda}{2} \sum_{i} w_i^2\]

The gradient update becomes:

\[w \leftarrow w - \eta \frac{\partial L_{\text{data}}}{\partial w} - \eta \lambda w\]

The extra term $-\eta \lambda w$ shrinks weights toward zero at each step, hence “weight decay.”

Why It Works

Large weights amplify noise and create sharp decision boundaries. L2 regularization encourages smoother functions with smaller weights, which generalize better. It’s equivalent to placing a Gaussian prior on the weights in a Bayesian framework.

Decoupled Weight Decay

In Adam and other adaptive optimizers, the standard L2 penalty interacts poorly with the adaptive learning rate. Loshchilov and Hutter (2017) introduced AdamW, which applies weight decay directly to the weights rather than through the gradient:

\[w \leftarrow (1 - \eta \lambda) w - \eta \cdot \text{Adam\_step}\]

AdamW became the standard optimizer for Transformers and LLMs.

L1 Regularization

\[L_{\text{reg}} = L_{\text{data}} + \lambda \sum_{i} |w_i|\]

L1 drives weights exactly to zero, producing sparse models. This is useful for feature selection but less common in deep learning, where sparsity is typically achieved through activation functions (ReLU) or pruning.

Dropout (2014)

Dropout, introduced by Srivastava, Hinton, and others, was one of AlexNet’s secret weapons and became one of the most important regularization techniques in deep learning.

The Idea

During training, randomly set each neuron’s output to zero with probability $p$ (typically 0.5 for hidden layers):

\[\hat{h}_i = \begin{cases} 0 & \text{with probability } p \\ \frac{h_i}{1-p} & \text{with probability } 1-p \end{cases}\]

The scaling by $\frac{1}{1-p}$ (inverted dropout) ensures that the expected value remains unchanged, so no adjustment is needed at test time.

Why Dropout Works

Multiple complementary explanations:

Ensemble effect: Each training step uses a different random sub-network. The final model is an implicit ensemble of $2^n$ sub-networks (where $n$ is the number of neurons)
Prevents co-adaptation: Neurons cannot rely on specific other neurons being present, forcing each neuron to be independently useful
Noise injection: Adding multiplicative noise acts as a regularizer, similar to data augmentation at the representation level
Implicit weight sharing: Related to Bayesian model averaging over network configurations

Dropout in Practice

Hidden layers: $p = 0.5$ is a common default
Input layer: $p = 0.2$ (lighter dropout, since losing input features is costly)
Convolutional layers: Spatial Dropout drops entire feature maps instead of individual pixels
At test time: All neurons are active (no dropout)
Modern note: Dropout is less commonly used in Transformer architectures, where Layer Normalization and large datasets provide sufficient regularization

DropConnect (2013)

Instead of dropping neuron outputs, DropConnect drops individual weights:

\[\hat{W}_{ij} = \begin{cases} 0 & \text{with probability } p \\ W_{ij} & \text{with probability } 1-p \end{cases}\]

This is a more general form of dropout but is rarely used in practice due to higher computational cost.

Data Augmentation

Instead of regularizing the model, augment the data. Create new training examples by applying transformations that preserve the label:

Image Augmentation

Transformation	Description
Random crop	Crop a random sub-region of the image
Horizontal flip	Mirror the image left-right
Color jitter	Randomly adjust brightness, contrast, saturation
Rotation	Small random rotations
Cutout / Random erasing	Mask out random rectangular regions
Mixup	Blend two images and their labels: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$
CutMix	Replace a patch of one image with a patch from another

Why Data Augmentation Is So Effective

It directly addresses the root cause of overfitting: insufficient data diversity. By encoding known invariances (a flipped cat is still a cat), augmentation provides virtually unlimited training data.

AutoAugment (2018)

Google used reinforcement learning to search for the best augmentation policy for each dataset. The discovered policies often included unexpected combinations of transformations that outperformed human-designed augmentation.

RandAugment (2019)

A simpler alternative: randomly apply $N$ transformations, each with magnitude $M$. Just two hyperparameters to tune, yet competitive with AutoAugment.

Early Stopping

Monitor the validation loss during training and stop when it starts increasing:

After each epoch, evaluate on a held-out validation set
Save the model weights whenever validation loss improves
If validation loss hasn’t improved for $k$ epochs (the “patience”), stop training
Restore the best saved weights

Early stopping is simple, effective, and nearly free. It’s one of the most underrated regularization techniques.

Connection to L2 Regularization

Bishop (1995) showed that early stopping is approximately equivalent to L2 regularization. Stopping training early constrains the weights to remain near their initialization, similar to the effect of weight decay.

Label Smoothing (2015)

Instead of training with hard targets (one-hot vectors), use “soft” targets:

\[y_{\text{smooth}} = (1 - \epsilon) \cdot y_{\text{hard}} + \frac{\epsilon}{K}\]

For example, with $\epsilon = 0.1$ and $K = 1000$ classes, the target for the correct class is $0.9001$ instead of $1.0$, and each incorrect class gets $0.0001$ instead of $0$.

Why It Works

Hard one-hot targets encourage the model to produce overconfident predictions with very large logits. Label smoothing:

Prevents the model from becoming too confident
Produces better-calibrated probability estimates
Acts as a regularizer that improves generalization

Label smoothing is standard in modern image classification (used in Inception v3, EfficientNet) and in training language models.

Stochastic Depth (2016)

A regularization technique specific to ResNets: during training, randomly drop entire residual blocks, making the network effectively shallower:

\[H_l = \begin{cases} f_l(H_{l-1}) + H_{l-1} & \text{with probability } p_l \\ H_{l-1} & \text{with probability } 1 - p_l \end{cases}\]

The survival probability $p_l$ decreases linearly with depth (early layers are dropped less frequently). This reduces training time by ~25% and improves generalization by acting as an implicit ensemble over networks of different depths.

Spectral Normalization (2018)

Constrains the spectral norm (largest singular value) of weight matrices to be at most 1:

\[\bar{W} = \frac{W}{\sigma(W)}\]

where $\sigma(W)$ is the largest singular value. This is especially important for training GANs (Chapter 9), where it stabilizes the discriminator.

Modern Regularization: What’s Used Today

Technique	Where Used	Still Common?
Weight decay (AdamW)	Everywhere	✅ Yes
Dropout	MLPs, some attention	⚠️ Less in Transformers
Data augmentation	Vision	✅ Yes, critical
Label smoothing	Classification	✅ Yes
Early stopping	All	✅ Yes
Stochastic depth	Deep ResNets	✅ Yes (ViT, ConvNeXt)
Gradient noise	Training	⚠️ Sometimes
Mixup / CutMix	Vision	✅ Yes

The modern trend is toward implicit regularization through:

Large datasets (less overfitting by default)
Normalization layers (BatchNorm, LayerNorm)
Architectural choices (attention, residual connections)
Training at scale (overparameterized models generalize surprisingly well)

Code Example: Dropout and Data Augmentation

# See code/ch06_regularization.py for full experiment
import torch.nn as nn
from torchvision import transforms

# Data augmentation pipeline
augmentation = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(0.4, 0.4, 0.4),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
])

# Model with dropout
model = nn.Sequential(
    nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.5),
    nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.3),
    nn.Linear(128, 10),
)

Key Takeaways

Weight decay (L2 regularization) shrinks weights, encouraging smoother functions
Dropout (2014) randomly zeros neurons during training, creating an implicit ensemble
Data augmentation encodes known invariances to create virtually unlimited training data
Label smoothing prevents overconfident predictions and improves calibration
Early stopping is simple, effective, and approximately equivalent to L2 regularization
Stochastic depth extends dropout to entire residual blocks
Modern models rely increasingly on implicit regularization from scale, normalization, and architecture

Previous Chapter: Recurrent Networks and Gating Mechanisms

Next Chapter: Residual Networks and Skip Connections

Back to Table of Contents