Chapter 2: Activation Functions — The Gates That Shape Gradients

Why Activation Functions Matter

Without activation functions, a neural network is just a series of matrix multiplications — which collapses into a single linear transformation, no matter how many layers you stack. Activation functions introduce non-linearity, allowing networks to learn complex decision boundaries and represent intricate functions.

But activation functions do far more than add non-linearity. They act as gates that control:

The choice of activation function has been one of the most impactful decisions in deep learning history.

The Sigmoid: The Original Gate (1960s–2012)

The logistic sigmoid was the default activation for decades:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Properties:

Why it was popular: It’s biologically inspired (neurons either fire or don’t), outputs are bounded, and it’s differentiable for backpropagation.

The fatal flaw — saturation:

The maximum value of $\sigma’(x)$ is $0.25$ (at $x=0$). For large positive or negative inputs, the gradient approaches zero. In a network with $L$ layers, gradients multiply through each layer:

\[\frac{\partial L}{\partial w^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)})\]

With each factor at most $0.25$, a 10-layer network shrinks gradients by a factor of $0.25^{10} \approx 10^{-6}$. This is the vanishing gradient problem, and it made training deep networks with sigmoids practically impossible.

The Hyperbolic Tangent (tanh)

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

Improvements over sigmoid:

Tanh became the preferred activation in the 1990s and 2000s. It still saturates for large inputs, but the zero-centered output significantly improves training dynamics compared to sigmoid.

ReLU: The Breakthrough That Changed Everything (2010–2012)

The Rectified Linear Unit is disarmingly simple:

\[\text{ReLU}(x) = \max(0, x)\]

This function was known for decades but was considered too simple — it’s not even differentiable at $x=0$! Yet it turned out to be perhaps the single most important activation function in deep learning history.

Why ReLU Works So Well

  1. No saturation for positive inputs: The gradient is exactly $1$ for $x > 0$, eliminating the vanishing gradient problem in the positive regime
  2. Sparsity: For any input, roughly half the neurons output zero, creating sparse representations that are computationally efficient and often more meaningful
  3. Computational efficiency: Just a comparison and a multiplication — orders of magnitude faster than computing exponentials
  4. Biological plausibility: Real neurons have a firing threshold; ReLU mimics this

The ReLU Revolution in Practice

AlexNet (2012) used ReLU and reported training 6× faster than with tanh on the same architecture. This speed advantage compounded: faster training meant more experiments, more experiments meant faster progress.

The Dying ReLU Problem

ReLU has its own weakness: if a neuron’s input is always negative (perhaps due to a large negative bias or an unfortunate weight initialization), it outputs zero for all inputs and receives zero gradient. It is effectively “dead” and will never recover.

In practice, this means a trained network can have 10–50% of its neurons permanently dead, wasting capacity.

ReLU Variants: Fixing the Dead Neuron Problem

Leaky ReLU (2013)

\[\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}\]

With $\alpha = 0.01$ typically. The small negative slope ensures gradients never fully vanish, preventing dead neurons.

Parametric ReLU (PReLU, 2015)

Same as Leaky ReLU, but $\alpha$ is a learnable parameter for each channel. He et al. showed this improved ImageNet accuracy at negligible computational cost.

Exponential Linear Unit (ELU, 2015)

\[\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\]

ELU has a smooth curve for negative inputs, producing negative outputs that push the mean activation closer to zero — similar to batch normalization’s effect.

Scaled Exponential Linear Unit (SELU, 2017)

SELU is carefully designed with specific values of $\alpha$ and a scaling factor $\lambda$ so that activations self-normalize — maintaining mean 0 and variance 1 through the network without batch normalization:

\[\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\]

GELU: The Modern Standard (2016/2020)

The Gaussian Error Linear Unit has become the default activation in Transformers:

\[\text{GELU}(x) = x \cdot \Phi(x)\]

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution.

Intuitively, GELU is a “soft” version of ReLU: instead of a hard threshold at zero, it smoothly transitions from blocking to passing signals. The stochastic interpretation is that each neuron is multiplied by a Bernoulli random variable whose probability depends on the input magnitude.

GELU is used in:

SiLU / Swish (2017)

\[\text{SiLU}(x) = x \cdot \sigma(x)\]

Discovered by Ramachandran et al. through automated search over activation function spaces. SiLU (also called Swish) is smooth, non-monotonic, and slightly outperforms ReLU on deep networks. It’s used in EfficientNet and many modern architectures.

GLU: Gated Linear Units (2016)

The Gated Linear Unit takes the gating concept to another level:

\[\text{GLU}(x, W, V, b, c) = (xW + b) \otimes \sigma(xV + c)\]

Here, $\otimes$ denotes element-wise multiplication. One linear transformation produces the “content,” and another produces a sigmoid “gate” that controls how much content passes through. This is a direct parallel to LSTM gates (Chapter 5) applied to feedforward layers.

Variants like SwiGLU (using SiLU instead of sigmoid for the gate) and GeGLU (using GELU) are now standard in large language models like LLaMA and PaLM.

Softmax: The Classification Gate

While not a hidden-layer activation, softmax is the critical gate for classification:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

Softmax converts a vector of raw scores into a probability distribution. It also plays a central role in the attention mechanism (Chapter 8), where it determines how much weight to give each element.

The Evolution at a Glance

Year Function Key Innovation Where Used
1960s Sigmoid First smooth, differentiable gate Early networks
1990s Tanh Zero-centered outputs RNNs, MLPs
2010 ReLU Non-saturating, sparse CNNs (AlexNet+)
2013 Leaky ReLU Fixes dying neurons Various
2015 PReLU Learnable negative slope ResNets
2016 GELU Smooth probabilistic gating Transformers
2017 SiLU/Swish Automated discovery EfficientNet
2020 SwiGLU Gated linear units with SiLU LLaMA, PaLM

Code Example: Comparing Activation Functions

# See code/ch02_activations.py for full visualization
import torch
import torch.nn.functional as F

x = torch.linspace(-5, 5, 1000)

activations = {
    "Sigmoid": torch.sigmoid(x),
    "Tanh": torch.tanh(x),
    "ReLU": F.relu(x),
    "GELU": F.gelu(x),
    "SiLU": F.silu(x),
}

Key Takeaways

The evolution of activation functions shows a recurring pattern in deep learning: simple ideas (like max(0, x)) often beat complex ones, and the right activation function can make the difference between a network that trains and one that doesn’t.


Previous Chapter: The Perceptron, Backpropagation, and Early Neural Networks

Next Chapter: Convolutional Neural Networks — From LeNet to AlexNet

Back to Table of Contents