If deeper networks are more powerful, why can’t we just keep adding layers? The answer plagued neural network research for over two decades: the vanishing gradient problem.
During backpropagation, gradients must travel from the loss function all the way back to the earliest layers. At each layer, the gradient is multiplied by the local derivative of the activation function and the layer’s weights. In a deep network, these multiplications compound — and depending on the magnitudes, the gradient either vanishes (shrinks to near-zero) or explodes (grows to infinity).
Consider a simple $L$-layer network where each layer computes $h^{(l)} = \sigma(W^{(l)} h^{(l-1)})$. The gradient of the loss with respect to the parameters of layer $l$ involves the product:
\[\frac{\partial L}{\partial W^{(l)}} \propto \prod_{k=l}^{L} \frac{\partial h^{(k)}}{\partial h^{(k-1)}} = \prod_{k=l}^{L} \text{diag}(\sigma'(z^{(k)})) \cdot W^{(k)}\]This is a product of $L - l$ matrices. If the spectral norm of each matrix is less than 1, the product shrinks exponentially. If greater than 1, it grows exponentially.
With sigmoid activations ($\sigma’_{\max} = 0.25$) and weights initialized near zero:
\[\|\text{gradient}\| \sim 0.25^L \cdot \|W\|^L\]For $L = 20$ layers: gradients in the early layers are $\sim 10^{-12}$ times smaller than in later layers. The first layers stop learning.
Conversely, if weights are too large, gradients grow exponentially. This manifests as:
Hochreiter first analyzed this problem rigorously in his 1991 diploma thesis, and it was one of the key motivations for LSTM networks (Chapter 5).
Xavier Glorot and Yoshua Bengio analyzed the variance of activations and gradients through a network and showed that to keep them stable, weights should be initialized from:
\[W^{(l)} \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)\]or uniformly from $\left[-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right]$
where $n_{in}$ and $n_{out}$ are the number of input and output units. The key insight: the variance should depend on the fan-in and fan-out of each layer.
This was derived under the assumption of linear activations or tanh (which is approximately linear near zero).
When using ReLU, Xavier initialization is suboptimal because ReLU zeros out half the outputs, effectively halving the variance. He et al. derived the correct initialization for ReLU networks:
\[W^{(l)} \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]This simple change enabled training much deeper networks (up to ~30 layers without residual connections).
Sergey Ioffe and Christian Szegedy introduced Batch Normalization (BatchNorm), one of the most impactful techniques in deep learning. Their motivation was “internal covariate shift” — the distribution of each layer’s inputs changes during training as the parameters of the preceding layers change.
For a mini-batch $B = {x_1, …, x_m}$ at each layer:
Compute batch statistics: \(\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2\)
Normalize: \(\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)
Scale and shift (with learnable parameters $\gamma$ and $\beta$): \(y_i = \gamma \hat{x}_i + \beta\)
The original “internal covariate shift” explanation has been debated. More recent analysis suggests BatchNorm works because:
BatchNorm has limitations:
Layer Normalization (Ba, Kiros, and Hinton, 2016) normalizes across the feature dimension instead of the batch dimension:
\[\hat{x}_i = \frac{x_i - \mu_{\text{layer}}}{\sqrt{\sigma_{\text{layer}}^2 + \epsilon}}\]where statistics are computed over all features of a single sample. This is:
| Normalization | Normalizes Over | Best For |
|---|---|---|
| Batch Norm | Batch dimension | CNNs |
| Layer Norm | Feature dimension | Transformers, RNNs |
| Instance Norm | Spatial dimensions per channel | Style transfer |
| Group Norm | Groups of channels | Small-batch training |
| RMS Norm | Feature dimension (no mean centering) | LLMs (LLaMA) |
RMS Normalization (2019) simplifies Layer Norm by removing the mean centering:
\[\hat{x}_i = \frac{x_i}{\text{RMS}(x)}, \quad \text{RMS}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^{n} x_i^2}\]This is cheaper to compute and used in many modern LLMs.
A pragmatic solution for exploding gradients: if the gradient norm exceeds a threshold, scale it down:
\[g \leftarrow \frac{\tau}{\|g\|} g \quad \text{if } \|g\| > \tau\]This doesn’t prevent the problem, but contains its damage. Gradient clipping is essential for training RNNs and is still used in modern Transformer training.
As discussed in Chapter 2, replacing sigmoid with ReLU eliminates vanishing gradients in the positive regime. The combination of ReLU + proper initialization + BatchNorm made it possible to train networks up to ~30 layers by 2015.
But going deeper still required a more fundamental architectural innovation — skip connections (Chapter 7).
# See code/ch04_gradient_flow.py for full experiment
import torch
import torch.nn as nn
# Without BatchNorm: gradients vanish in deep networks
class DeepNet(nn.Module):
def __init__(self, depth=20, use_batchnorm=False):
super().__init__()
layers = []
for _ in range(depth):
layers.append(nn.Linear(256, 256))
if use_batchnorm:
layers.append(nn.BatchNorm1d(256))
layers.append(nn.ReLU())
self.net = nn.Sequential(*layers)
There’s a unifying way to think about all these solutions. Each one creates a better highway for gradients to flow backward through the network:
The vanishing gradient problem was never “solved” in one step — it was progressively conquered by a combination of techniques, each addressing a different aspect of the issue.
Previous Chapter: Convolutional Neural Networks
Next Chapter: Recurrent Networks and Gating Mechanisms — LSTM & GRU