Chapter 7: Residual Networks and Skip Connections (2015)

The Degradation Problem

By 2015, the deep learning community had identified a frustrating paradox. Theoretically, a deeper network should perform at least as well as a shallower one — the extra layers could simply learn the identity function. In practice, the opposite happened: adding more layers to a well-performing 20-layer network made it worse, not better.

This wasn’t overfitting. The training error itself was higher for the deeper network. The optimization simply couldn’t find a good solution in the higher-dimensional weight space.

He, Zhang, Ren, and Sun (from Microsoft Research) proposed an elegantly simple solution: residual learning.

The Core Idea: Learn the Residual

Instead of learning a desired mapping $H(x)$ directly, let the network learn the residual $F(x) = H(x) - x$:

\[H(x) = F(x) + x\]

This is implemented with a skip connection (also called a shortcut connection) that bypasses one or more layers:

Input x ──────────────────────────────┐
   │                                  │
   ├── Conv → BN → ReLU → Conv → BN  │ = F(x)
   │                                  │
   └────────────── + ─────────────────┘ = F(x) + x
                   │
                 ReLU
                   │
              Output H(x)

Why This Works

If the optimal transformation is close to the identity (which is true for many layers in a deep network), it’s much easier for the network to learn $F(x) \approx 0$ than to learn $H(x) \approx x$ from scratch. Pushing weights toward zero is straightforward with weight decay; learning the identity function through multiple non-linear layers is not.

Gradient Flow Through Skip Connections

The gradient through a residual block is:

\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial H} \cdot \frac{\partial H}{\partial x} = \frac{\partial L}{\partial H} \cdot \left(1 + \frac{\partial F}{\partial x}\right)\]

The key term is the 1 — it provides a direct gradient pathway that doesn’t depend on the learned transformation $F$. Even if $\frac{\partial F}{\partial x}$ is small, the gradient never vanishes because it always has the identity component.

In a network with $L$ residual blocks, the gradient from loss to input includes a sum over all possible paths:

\[\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \sum_{\text{paths}} \prod_{\text{blocks in path}} \left(1 + \frac{\partial F_k}{\partial x_k}\right)\]

This means gradients can flow through any subset of blocks — they don’t have to traverse every single layer. This is why ResNets can be trained at depths that were previously impossible.

ResNet Architectures

ResNet-18/34: Basic Blocks

For shallower ResNets, each residual block contains two 3×3 convolutions:

\[F(x) = W_2 \cdot \text{ReLU}(\text{BN}(W_1 \cdot x))\]

ResNet-50/101/152: Bottleneck Blocks

For deeper networks, a bottleneck design reduces computation:

x → Conv(1×1, reduce channels) → BN → ReLU
  → Conv(3×3, same channels)   → BN → ReLU  
  → Conv(1×1, expand channels) → BN → (+x) → ReLU

The 1×1 convolutions reduce and then restore the channel dimension, making the expensive 3×3 convolution operate on fewer channels.

Handling Dimension Changes

When the input and output dimensions don’t match (due to spatial downsampling or channel changes), the shortcut uses a projection:

\[H(x) = F(x) + W_s x\]

where $W_s$ is a 1×1 convolution that matches dimensions.

Results That Shook the Field

ResNet won the 2015 ILSVRC competition with a 152-layer network, achieving 3.57% top-5 error — surpassing human-level performance (estimated at ~5.1%).

Network	Depth	Top-5 Error	Parameters
AlexNet	8	15.3%	62M
VGG-19	19	7.3%	144M
GoogLeNet	22	6.7%	6.8M
ResNet-152	152	3.57%	60M

ResNet achieved the best performance with fewer parameters than VGG, thanks to the bottleneck design and global average pooling.

Pre-Activation ResNet (2016)

He et al. published a follow-up showing that moving BatchNorm and ReLU before the convolution (instead of after) improves performance:

Original order: Conv → BN → ReLU → Conv → BN → Add → ReLU

Pre-activation order: BN → ReLU → Conv → BN → ReLU → Conv → Add

The pre-activation design creates a cleaner identity pathway through the skip connection, as the addition is no longer followed by a ReLU that could block gradient flow.

Skip Connections Beyond ResNets

DenseNet: Dense Connections (2016)

DenseNet took skip connections to the extreme: each layer receives feature maps from all preceding layers:

\[x_l = F_l([x_0, x_1, ..., x_{l-1}])\]

where $[…]$ denotes concatenation. Benefits:

Maximum gradient flow — every layer has a direct connection to the loss
Feature reuse — early features are accessible throughout the network
Parameter efficiency — fewer filters needed per layer

Highway Networks (2015)

Published slightly before ResNets, Highway Networks introduced learned gates that control how much information flows through the skip connection:

\[H(x) = T(x) \odot F(x) + (1 - T(x)) \odot x\]

where $T(x) = \sigma(W_T x + b_T)$ is a transform gate (sigmoid output). When $T(x) = 0$, the layer is a pure identity; when $T(x) = 1$, the layer is a pure transformation.

This is conceptually beautiful — it’s an LSTM-style gate for feedforward networks. But in practice, the simpler ResNet (without gates) performed better, suggesting that the network can learn when to “do nothing” without explicit gating.

U-Net: Skip Connections for Segmentation (2015)

U-Net introduced skip connections between encoder and decoder at matching resolutions:

Encoder:      [64] → [128] → [256] → [512]
                ↓       ↓       ↓
Decoder: [64] ← [128] ← [256] ← [512]

These connections allow the decoder to use high-resolution features from the encoder for precise pixel-level predictions. U-Net became the standard for medical image segmentation and later became a key component in diffusion models (Chapter 13).

Feature Pyramid Networks (2017)

FPN combines multi-scale features through lateral connections, enabling object detection at multiple scales. This is essentially skip connections applied to feature pyramids.

The Deeper Lesson: Why Skip Connections Are So Fundamental

Skip connections appear everywhere in modern deep learning:

ResNets and variants: The foundation of deep CNNs
Transformers: Residual connections around every attention and feedforward block
U-Nets: Skip connections in encoder-decoder architectures
Diffusion models: U-Net with skip connections for noise prediction
LLMs: Pre-norm Transformer blocks with residual connections

The reason is both mathematical and intuitive:

Gradient flow: Skip connections create gradient highways that enable training at any depth
Optimization landscape: They smooth the loss surface, making it easier for SGD to find good solutions
Ensemble behavior: A ResNet with $L$ blocks implicitly represents an ensemble of $2^L$ paths of different lengths
Identity as default: Making “do nothing” the default behavior of each layer is a powerful inductive bias

Code Example: Residual Block in PyTorch

# See code/ch07_resnet.py for full architecture
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(channels, channels, 3, padding=1),
            nn.BatchNorm2d(channels),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=1),
            nn.BatchNorm2d(channels),
        )
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.block(x) + x)  # The skip connection

Key Takeaways

The degradation problem showed that deeper networks performed worse, even on training data
Residual learning reformulates the problem: learn $F(x) = H(x) - x$ instead of $H(x)$
Skip connections provide gradient highways with a guaranteed identity component
ResNet-152 surpassed human-level performance on ImageNet in 2015
Pre-activation ordering (BN → ReLU → Conv) creates cleaner identity paths
DenseNet extended skip connections to connect every layer to every other layer
Skip connections became a universal building block: ResNets, Transformers, U-Nets, diffusion models

The skip connection is arguably the most important architectural innovation in deep learning. It transformed neural networks from shallow approximators into arbitrarily deep function learners. And its influence would only grow — when the Transformer was introduced in 2017, residual connections were baked in from the start.

Previous Chapter: Regularization Strategies — Taming Overfitting

Next Chapter: Attention Mechanisms and the Transformer

Back to Table of Contents