By 2015, the deep learning community had identified a frustrating paradox. Theoretically, a deeper network should perform at least as well as a shallower one — the extra layers could simply learn the identity function. In practice, the opposite happened: adding more layers to a well-performing 20-layer network made it worse, not better.
This wasn’t overfitting. The training error itself was higher for the deeper network. The optimization simply couldn’t find a good solution in the higher-dimensional weight space.
He, Zhang, Ren, and Sun (from Microsoft Research) proposed an elegantly simple solution: residual learning.
Instead of learning a desired mapping $H(x)$ directly, let the network learn the residual $F(x) = H(x) - x$:
\[H(x) = F(x) + x\]This is implemented with a skip connection (also called a shortcut connection) that bypasses one or more layers:
Input x ──────────────────────────────┐
│ │
├── Conv → BN → ReLU → Conv → BN │ = F(x)
│ │
└────────────── + ─────────────────┘ = F(x) + x
│
ReLU
│
Output H(x)
If the optimal transformation is close to the identity (which is true for many layers in a deep network), it’s much easier for the network to learn $F(x) \approx 0$ than to learn $H(x) \approx x$ from scratch. Pushing weights toward zero is straightforward with weight decay; learning the identity function through multiple non-linear layers is not.
The gradient through a residual block is:
\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial H} \cdot \frac{\partial H}{\partial x} = \frac{\partial L}{\partial H} \cdot \left(1 + \frac{\partial F}{\partial x}\right)\]The key term is the 1 — it provides a direct gradient pathway that doesn’t depend on the learned transformation $F$. Even if $\frac{\partial F}{\partial x}$ is small, the gradient never vanishes because it always has the identity component.
In a network with $L$ residual blocks, the gradient from loss to input includes a sum over all possible paths:
\[\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \sum_{\text{paths}} \prod_{\text{blocks in path}} \left(1 + \frac{\partial F_k}{\partial x_k}\right)\]This means gradients can flow through any subset of blocks — they don’t have to traverse every single layer. This is why ResNets can be trained at depths that were previously impossible.
For shallower ResNets, each residual block contains two 3×3 convolutions:
\[F(x) = W_2 \cdot \text{ReLU}(\text{BN}(W_1 \cdot x))\]For deeper networks, a bottleneck design reduces computation:
x → Conv(1×1, reduce channels) → BN → ReLU
→ Conv(3×3, same channels) → BN → ReLU
→ Conv(1×1, expand channels) → BN → (+x) → ReLU
The 1×1 convolutions reduce and then restore the channel dimension, making the expensive 3×3 convolution operate on fewer channels.
When the input and output dimensions don’t match (due to spatial downsampling or channel changes), the shortcut uses a projection:
\[H(x) = F(x) + W_s x\]where $W_s$ is a 1×1 convolution that matches dimensions.
ResNet won the 2015 ILSVRC competition with a 152-layer network, achieving 3.57% top-5 error — surpassing human-level performance (estimated at ~5.1%).
| Network | Depth | Top-5 Error | Parameters |
|---|---|---|---|
| AlexNet | 8 | 15.3% | 62M |
| VGG-19 | 19 | 7.3% | 144M |
| GoogLeNet | 22 | 6.7% | 6.8M |
| ResNet-152 | 152 | 3.57% | 60M |
ResNet achieved the best performance with fewer parameters than VGG, thanks to the bottleneck design and global average pooling.
He et al. published a follow-up showing that moving BatchNorm and ReLU before the convolution (instead of after) improves performance:
Original order: Conv → BN → ReLU → Conv → BN → Add → ReLU
Pre-activation order: BN → ReLU → Conv → BN → ReLU → Conv → Add
The pre-activation design creates a cleaner identity pathway through the skip connection, as the addition is no longer followed by a ReLU that could block gradient flow.
DenseNet took skip connections to the extreme: each layer receives feature maps from all preceding layers:
\[x_l = F_l([x_0, x_1, ..., x_{l-1}])\]where $[…]$ denotes concatenation. Benefits:
Published slightly before ResNets, Highway Networks introduced learned gates that control how much information flows through the skip connection:
\[H(x) = T(x) \odot F(x) + (1 - T(x)) \odot x\]where $T(x) = \sigma(W_T x + b_T)$ is a transform gate (sigmoid output). When $T(x) = 0$, the layer is a pure identity; when $T(x) = 1$, the layer is a pure transformation.
This is conceptually beautiful — it’s an LSTM-style gate for feedforward networks. But in practice, the simpler ResNet (without gates) performed better, suggesting that the network can learn when to “do nothing” without explicit gating.
U-Net introduced skip connections between encoder and decoder at matching resolutions:
Encoder: [64] → [128] → [256] → [512]
↓ ↓ ↓
Decoder: [64] ← [128] ← [256] ← [512]
These connections allow the decoder to use high-resolution features from the encoder for precise pixel-level predictions. U-Net became the standard for medical image segmentation and later became a key component in diffusion models (Chapter 13).
FPN combines multi-scale features through lateral connections, enabling object detection at multiple scales. This is essentially skip connections applied to feature pyramids.
Skip connections appear everywhere in modern deep learning:
The reason is both mathematical and intuitive:
# See code/ch07_resnet.py for full architecture
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.block = nn.Sequential(
nn.Conv2d(channels, channels, 3, padding=1),
nn.BatchNorm2d(channels),
nn.ReLU(),
nn.Conv2d(channels, channels, 3, padding=1),
nn.BatchNorm2d(channels),
)
self.relu = nn.ReLU()
def forward(self, x):
return self.relu(self.block(x) + x) # The skip connection
The skip connection is arguably the most important architectural innovation in deep learning. It transformed neural networks from shallow approximators into arbitrarily deep function learners. And its influence would only grow — when the Transformer was introduced in 2017, residual connections were baked in from the start.
Previous Chapter: Regularization Strategies — Taming Overfitting
Next Chapter: Attention Mechanisms and the Transformer