Chapter 2: Activation Functions — The Gates That Shape Gradients

Why Activation Functions Matter

Without activation functions, a neural network is just a series of matrix multiplications — which collapses into a single linear transformation, no matter how many layers you stack. Activation functions introduce non-linearity, allowing networks to learn complex decision boundaries and represent intricate functions.

But activation functions do far more than add non-linearity. They act as gates that control:

Which neurons are active and which are silent
How gradients flow backward during training
The effective depth of the network

The choice of activation function has been one of the most impactful decisions in deep learning history.

The Sigmoid: The Original Gate (1960s–2012)

The logistic sigmoid was the default activation for decades:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Properties:

Outputs in $(0, 1)$ — interpretable as a probability
Smooth and differentiable everywhere
Derivative: $\sigma’(x) = \sigma(x)(1 - \sigma(x))$

Why it was popular: It’s biologically inspired (neurons either fire or don’t), outputs are bounded, and it’s differentiable for backpropagation.

The fatal flaw — saturation:

The maximum value of $\sigma’(x)$ is $0.25$ (at $x=0$). For large positive or negative inputs, the gradient approaches zero. In a network with $L$ layers, gradients multiply through each layer:

\[\frac{\partial L}{\partial w^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)})\]

With each factor at most $0.25$, a 10-layer network shrinks gradients by a factor of $0.25^{10} \approx 10^{-6}$. This is the vanishing gradient problem, and it made training deep networks with sigmoids practically impossible.

The Hyperbolic Tangent (tanh)

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

Improvements over sigmoid:

Outputs in $(-1, 1)$ — zero-centered, which helps gradient flow
Derivative: $\tanh’(x) = 1 - \tanh^2(x)$, with maximum value of $1.0$

Tanh became the preferred activation in the 1990s and 2000s. It still saturates for large inputs, but the zero-centered output significantly improves training dynamics compared to sigmoid.

ReLU: The Breakthrough That Changed Everything (2010–2012)

The Rectified Linear Unit is disarmingly simple:

\[\text{ReLU}(x) = \max(0, x)\]

This function was known for decades but was considered too simple — it’s not even differentiable at $x=0$! Yet it turned out to be perhaps the single most important activation function in deep learning history.

Why ReLU Works So Well

No saturation for positive inputs: The gradient is exactly $1$ for $x > 0$, eliminating the vanishing gradient problem in the positive regime
Sparsity: For any input, roughly half the neurons output zero, creating sparse representations that are computationally efficient and often more meaningful
Computational efficiency: Just a comparison and a multiplication — orders of magnitude faster than computing exponentials
Biological plausibility: Real neurons have a firing threshold; ReLU mimics this

The ReLU Revolution in Practice

AlexNet (2012) used ReLU and reported training 6× faster than with tanh on the same architecture. This speed advantage compounded: faster training meant more experiments, more experiments meant faster progress.

The Dying ReLU Problem

ReLU has its own weakness: if a neuron’s input is always negative (perhaps due to a large negative bias or an unfortunate weight initialization), it outputs zero for all inputs and receives zero gradient. It is effectively “dead” and will never recover.

In practice, this means a trained network can have 10–50% of its neurons permanently dead, wasting capacity.

ReLU Variants: Fixing the Dead Neuron Problem

Leaky ReLU (2013)

\[\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}\]

With $\alpha = 0.01$ typically. The small negative slope ensures gradients never fully vanish, preventing dead neurons.

Parametric ReLU (PReLU, 2015)

Same as Leaky ReLU, but $\alpha$ is a learnable parameter for each channel. He et al. showed this improved ImageNet accuracy at negligible computational cost.

Exponential Linear Unit (ELU, 2015)

\[\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\]

ELU has a smooth curve for negative inputs, producing negative outputs that push the mean activation closer to zero — similar to batch normalization’s effect.

Scaled Exponential Linear Unit (SELU, 2017)

SELU is carefully designed with specific values of $\alpha$ and a scaling factor $\lambda$ so that activations self-normalize — maintaining mean 0 and variance 1 through the network without batch normalization:

\[\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\]

GELU: The Modern Standard (2016/2020)

The Gaussian Error Linear Unit has become the default activation in Transformers:

\[\text{GELU}(x) = x \cdot \Phi(x)\]

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution.

Intuitively, GELU is a “soft” version of ReLU: instead of a hard threshold at zero, it smoothly transitions from blocking to passing signals. The stochastic interpretation is that each neuron is multiplied by a Bernoulli random variable whose probability depends on the input magnitude.

GELU is used in:

BERT (2018)
GPT-2/3/4 (2019–2023)
Vision Transformers (2020)

SiLU / Swish (2017)

\[\text{SiLU}(x) = x \cdot \sigma(x)\]

Discovered by Ramachandran et al. through automated search over activation function spaces. SiLU (also called Swish) is smooth, non-monotonic, and slightly outperforms ReLU on deep networks. It’s used in EfficientNet and many modern architectures.

GLU: Gated Linear Units (2016)

The Gated Linear Unit takes the gating concept to another level:

\[\text{GLU}(x, W, V, b, c) = (xW + b) \otimes \sigma(xV + c)\]

Here, $\otimes$ denotes element-wise multiplication. One linear transformation produces the “content,” and another produces a sigmoid “gate” that controls how much content passes through. This is a direct parallel to LSTM gates (Chapter 5) applied to feedforward layers.

Variants like SwiGLU (using SiLU instead of sigmoid for the gate) and GeGLU (using GELU) are now standard in large language models like LLaMA and PaLM.

Softmax: The Classification Gate

While not a hidden-layer activation, softmax is the critical gate for classification:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

Softmax converts a vector of raw scores into a probability distribution. It also plays a central role in the attention mechanism (Chapter 8), where it determines how much weight to give each element.

The Evolution at a Glance

Year	Function	Key Innovation	Where Used
1960s	Sigmoid	First smooth, differentiable gate	Early networks
1990s	Tanh	Zero-centered outputs	RNNs, MLPs
2010	ReLU	Non-saturating, sparse	CNNs (AlexNet+)
2013	Leaky ReLU	Fixes dying neurons	Various
2015	PReLU	Learnable negative slope	ResNets
2016	GELU	Smooth probabilistic gating	Transformers
2017	SiLU/Swish	Automated discovery	EfficientNet
2020	SwiGLU	Gated linear units with SiLU	LLaMA, PaLM

Code Example: Comparing Activation Functions

# See code/ch02_activations.py for full visualization
import torch
import torch.nn.functional as F

x = torch.linspace(-5, 5, 1000)

activations = {
    "Sigmoid": torch.sigmoid(x),
    "Tanh": torch.tanh(x),
    "ReLU": F.relu(x),
    "GELU": F.gelu(x),
    "SiLU": F.silu(x),
}

Key Takeaways

Activation functions are gates that control information and gradient flow
Sigmoid saturates, causing vanishing gradients in deep networks
ReLU (2010/2012) was a pivotal breakthrough: non-saturating, sparse, fast
The dying ReLU problem spawned variants: Leaky ReLU, PReLU, ELU, SELU
GELU and SwiGLU are the modern standards in Transformers and LLMs
The choice of activation function is inseparable from the architecture and training strategy

The evolution of activation functions shows a recurring pattern in deep learning: simple ideas (like max(0, x)) often beat complex ones, and the right activation function can make the difference between a network that trains and one that doesn’t.

Previous Chapter: The Perceptron, Backpropagation, and Early Neural Networks

Next Chapter: Convolutional Neural Networks — From LeNet to AlexNet

Back to Table of Contents