Deep neural networks are extremely powerful function approximators — often too powerful. A network with millions of parameters can easily memorize a training set rather than learning generalizable patterns. This is overfitting: low training error but high test error.
Regularization is the collection of techniques that prevent overfitting and improve generalization. These strategies became critical during the deep learning renaissance, enabling larger models trained on limited data.
The oldest and simplest regularization: add a penalty proportional to the squared magnitude of the weights to the loss function:
\[L_{\text{reg}} = L_{\text{data}} + \frac{\lambda}{2} \sum_{i} w_i^2\]The gradient update becomes:
\[w \leftarrow w - \eta \frac{\partial L_{\text{data}}}{\partial w} - \eta \lambda w\]The extra term $-\eta \lambda w$ shrinks weights toward zero at each step, hence “weight decay.”
Large weights amplify noise and create sharp decision boundaries. L2 regularization encourages smoother functions with smaller weights, which generalize better. It’s equivalent to placing a Gaussian prior on the weights in a Bayesian framework.
In Adam and other adaptive optimizers, the standard L2 penalty interacts poorly with the adaptive learning rate. Loshchilov and Hutter (2017) introduced AdamW, which applies weight decay directly to the weights rather than through the gradient:
\[w \leftarrow (1 - \eta \lambda) w - \eta \cdot \text{Adam\_step}\]AdamW became the standard optimizer for Transformers and LLMs.
L1 drives weights exactly to zero, producing sparse models. This is useful for feature selection but less common in deep learning, where sparsity is typically achieved through activation functions (ReLU) or pruning.
Dropout, introduced by Srivastava, Hinton, and others, was one of AlexNet’s secret weapons and became one of the most important regularization techniques in deep learning.
During training, randomly set each neuron’s output to zero with probability $p$ (typically 0.5 for hidden layers):
\[\hat{h}_i = \begin{cases} 0 & \text{with probability } p \\ \frac{h_i}{1-p} & \text{with probability } 1-p \end{cases}\]The scaling by $\frac{1}{1-p}$ (inverted dropout) ensures that the expected value remains unchanged, so no adjustment is needed at test time.
Multiple complementary explanations:
Instead of dropping neuron outputs, DropConnect drops individual weights:
\[\hat{W}_{ij} = \begin{cases} 0 & \text{with probability } p \\ W_{ij} & \text{with probability } 1-p \end{cases}\]This is a more general form of dropout but is rarely used in practice due to higher computational cost.
Instead of regularizing the model, augment the data. Create new training examples by applying transformations that preserve the label:
| Transformation | Description |
|---|---|
| Random crop | Crop a random sub-region of the image |
| Horizontal flip | Mirror the image left-right |
| Color jitter | Randomly adjust brightness, contrast, saturation |
| Rotation | Small random rotations |
| Cutout / Random erasing | Mask out random rectangular regions |
| Mixup | Blend two images and their labels: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$ |
| CutMix | Replace a patch of one image with a patch from another |
It directly addresses the root cause of overfitting: insufficient data diversity. By encoding known invariances (a flipped cat is still a cat), augmentation provides virtually unlimited training data.
Google used reinforcement learning to search for the best augmentation policy for each dataset. The discovered policies often included unexpected combinations of transformations that outperformed human-designed augmentation.
A simpler alternative: randomly apply $N$ transformations, each with magnitude $M$. Just two hyperparameters to tune, yet competitive with AutoAugment.
Monitor the validation loss during training and stop when it starts increasing:
Early stopping is simple, effective, and nearly free. It’s one of the most underrated regularization techniques.
Bishop (1995) showed that early stopping is approximately equivalent to L2 regularization. Stopping training early constrains the weights to remain near their initialization, similar to the effect of weight decay.
Instead of training with hard targets (one-hot vectors), use “soft” targets:
\[y_{\text{smooth}} = (1 - \epsilon) \cdot y_{\text{hard}} + \frac{\epsilon}{K}\]For example, with $\epsilon = 0.1$ and $K = 1000$ classes, the target for the correct class is $0.9001$ instead of $1.0$, and each incorrect class gets $0.0001$ instead of $0$.
Hard one-hot targets encourage the model to produce overconfident predictions with very large logits. Label smoothing:
Label smoothing is standard in modern image classification (used in Inception v3, EfficientNet) and in training language models.
A regularization technique specific to ResNets: during training, randomly drop entire residual blocks, making the network effectively shallower:
\[H_l = \begin{cases} f_l(H_{l-1}) + H_{l-1} & \text{with probability } p_l \\ H_{l-1} & \text{with probability } 1 - p_l \end{cases}\]The survival probability $p_l$ decreases linearly with depth (early layers are dropped less frequently). This reduces training time by ~25% and improves generalization by acting as an implicit ensemble over networks of different depths.
Constrains the spectral norm (largest singular value) of weight matrices to be at most 1:
\[\bar{W} = \frac{W}{\sigma(W)}\]where $\sigma(W)$ is the largest singular value. This is especially important for training GANs (Chapter 9), where it stabilizes the discriminator.
| Technique | Where Used | Still Common? |
|---|---|---|
| Weight decay (AdamW) | Everywhere | ✅ Yes |
| Dropout | MLPs, some attention | ⚠️ Less in Transformers |
| Data augmentation | Vision | ✅ Yes, critical |
| Label smoothing | Classification | ✅ Yes |
| Early stopping | All | ✅ Yes |
| Stochastic depth | Deep ResNets | ✅ Yes (ViT, ConvNeXt) |
| Gradient noise | Training | ⚠️ Sometimes |
| Mixup / CutMix | Vision | ✅ Yes |
The modern trend is toward implicit regularization through:
# See code/ch06_regularization.py for full experiment
import torch.nn as nn
from torchvision import transforms
# Data augmentation pipeline
augmentation = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.4, 0.4, 0.4),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
])
# Model with dropout
model = nn.Sequential(
nn.Linear(512, 256), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, 10),
)
Previous Chapter: Recurrent Networks and Gating Mechanisms
Next Chapter: Residual Networks and Skip Connections