Chapter 4: The Vanishing Gradient Problem and Its Solutions

The Central Problem of Deep Learning

If deeper networks are more powerful, why can’t we just keep adding layers? The answer plagued neural network research for over two decades: the vanishing gradient problem.

During backpropagation, gradients must travel from the loss function all the way back to the earliest layers. At each layer, the gradient is multiplied by the local derivative of the activation function and the layer’s weights. In a deep network, these multiplications compound — and depending on the magnitudes, the gradient either vanishes (shrinks to near-zero) or explodes (grows to infinity).

The Mathematics of Gradient Flow

Consider a simple $L$-layer network where each layer computes $h^{(l)} = \sigma(W^{(l)} h^{(l-1)})$. The gradient of the loss with respect to the parameters of layer $l$ involves the product:

\[\frac{\partial L}{\partial W^{(l)}} \propto \prod_{k=l}^{L} \frac{\partial h^{(k)}}{\partial h^{(k-1)}} = \prod_{k=l}^{L} \text{diag}(\sigma'(z^{(k)})) \cdot W^{(k)}\]

This is a product of $L - l$ matrices. If the spectral norm of each matrix is less than 1, the product shrinks exponentially. If greater than 1, it grows exponentially.

Vanishing Gradients

With sigmoid activations ($\sigma’_{\max} = 0.25$) and weights initialized near zero:

\[\|\text{gradient}\| \sim 0.25^L \cdot \|W\|^L\]

For $L = 20$ layers: gradients in the early layers are $\sim 10^{-12}$ times smaller than in later layers. The first layers stop learning.

Exploding Gradients

Conversely, if weights are too large, gradients grow exponentially. This manifests as:

Hochreiter first analyzed this problem rigorously in his 1991 diploma thesis, and it was one of the key motivations for LSTM networks (Chapter 5).

Solution 1: Better Weight Initialization

Xavier/Glorot Initialization (2010)

Xavier Glorot and Yoshua Bengio analyzed the variance of activations and gradients through a network and showed that to keep them stable, weights should be initialized from:

\[W^{(l)} \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)\]

or uniformly from $\left[-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right]$

where $n_{in}$ and $n_{out}$ are the number of input and output units. The key insight: the variance should depend on the fan-in and fan-out of each layer.

This was derived under the assumption of linear activations or tanh (which is approximately linear near zero).

Kaiming/He Initialization (2015)

When using ReLU, Xavier initialization is suboptimal because ReLU zeros out half the outputs, effectively halving the variance. He et al. derived the correct initialization for ReLU networks:

\[W^{(l)} \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]

This simple change enabled training much deeper networks (up to ~30 layers without residual connections).

Solution 2: Batch Normalization (2015)

The Internal Covariate Shift Problem

Sergey Ioffe and Christian Szegedy introduced Batch Normalization (BatchNorm), one of the most impactful techniques in deep learning. Their motivation was “internal covariate shift” — the distribution of each layer’s inputs changes during training as the parameters of the preceding layers change.

How It Works

For a mini-batch $B = {x_1, …, x_m}$ at each layer:

  1. Compute batch statistics: \(\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2\)

  2. Normalize: \(\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)

  3. Scale and shift (with learnable parameters $\gamma$ and $\beta$): \(y_i = \gamma \hat{x}_i + \beta\)

Why BatchNorm Is So Effective

The original “internal covariate shift” explanation has been debated. More recent analysis suggests BatchNorm works because:

  1. Smoother loss landscape: BatchNorm makes the optimization landscape significantly smoother, allowing larger learning rates
  2. Gradient flow: By normalizing activations, it prevents them from growing or shrinking through layers
  3. Implicit regularization: The noise from batch statistics acts as a regularizer, similar to dropout
  4. Decoupling layers: Each layer can be trained somewhat independently of the others

BatchNorm in Practice

Solution 3: Layer Normalization (2016)

BatchNorm has limitations:

Layer Normalization (Ba, Kiros, and Hinton, 2016) normalizes across the feature dimension instead of the batch dimension:

\[\hat{x}_i = \frac{x_i - \mu_{\text{layer}}}{\sqrt{\sigma_{\text{layer}}^2 + \epsilon}}\]

where statistics are computed over all features of a single sample. This is:

Other Normalization Variants

Normalization Normalizes Over Best For
Batch Norm Batch dimension CNNs
Layer Norm Feature dimension Transformers, RNNs
Instance Norm Spatial dimensions per channel Style transfer
Group Norm Groups of channels Small-batch training
RMS Norm Feature dimension (no mean centering) LLMs (LLaMA)

RMS Normalization (2019) simplifies Layer Norm by removing the mean centering:

\[\hat{x}_i = \frac{x_i}{\text{RMS}(x)}, \quad \text{RMS}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^{n} x_i^2}\]

This is cheaper to compute and used in many modern LLMs.

Solution 4: Gradient Clipping

A pragmatic solution for exploding gradients: if the gradient norm exceeds a threshold, scale it down:

\[g \leftarrow \frac{\tau}{\|g\|} g \quad \text{if } \|g\| > \tau\]

This doesn’t prevent the problem, but contains its damage. Gradient clipping is essential for training RNNs and is still used in modern Transformer training.

Solution 5: Better Activation Functions

As discussed in Chapter 2, replacing sigmoid with ReLU eliminates vanishing gradients in the positive regime. The combination of ReLU + proper initialization + BatchNorm made it possible to train networks up to ~30 layers by 2015.

But going deeper still required a more fundamental architectural innovation — skip connections (Chapter 7).

Code Example: The Impact of Initialization and Normalization

# See code/ch04_gradient_flow.py for full experiment
import torch
import torch.nn as nn

# Without BatchNorm: gradients vanish in deep networks
class DeepNet(nn.Module):
    def __init__(self, depth=20, use_batchnorm=False):
        super().__init__()
        layers = []
        for _ in range(depth):
            layers.append(nn.Linear(256, 256))
            if use_batchnorm:
                layers.append(nn.BatchNorm1d(256))
            layers.append(nn.ReLU())
        self.net = nn.Sequential(*layers)

The Gradient Highway

There’s a unifying way to think about all these solutions. Each one creates a better highway for gradients to flow backward through the network:

The vanishing gradient problem was never “solved” in one step — it was progressively conquered by a combination of techniques, each addressing a different aspect of the issue.

Key Takeaways


Previous Chapter: Convolutional Neural Networks

Next Chapter: Recurrent Networks and Gating Mechanisms — LSTM & GRU

Back to Table of Contents