Without activation functions, a neural network is just a series of matrix multiplications — which collapses into a single linear transformation, no matter how many layers you stack. Activation functions introduce non-linearity, allowing networks to learn complex decision boundaries and represent intricate functions.
But activation functions do far more than add non-linearity. They act as gates that control:
The choice of activation function has been one of the most impactful decisions in deep learning history.
The logistic sigmoid was the default activation for decades:
\[\sigma(x) = \frac{1}{1 + e^{-x}}\]Properties:
Why it was popular: It’s biologically inspired (neurons either fire or don’t), outputs are bounded, and it’s differentiable for backpropagation.
The fatal flaw — saturation:
The maximum value of $\sigma’(x)$ is $0.25$ (at $x=0$). For large positive or negative inputs, the gradient approaches zero. In a network with $L$ layers, gradients multiply through each layer:
\[\frac{\partial L}{\partial w^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)})\]With each factor at most $0.25$, a 10-layer network shrinks gradients by a factor of $0.25^{10} \approx 10^{-6}$. This is the vanishing gradient problem, and it made training deep networks with sigmoids practically impossible.
Improvements over sigmoid:
Tanh became the preferred activation in the 1990s and 2000s. It still saturates for large inputs, but the zero-centered output significantly improves training dynamics compared to sigmoid.
The Rectified Linear Unit is disarmingly simple:
\[\text{ReLU}(x) = \max(0, x)\]This function was known for decades but was considered too simple — it’s not even differentiable at $x=0$! Yet it turned out to be perhaps the single most important activation function in deep learning history.
AlexNet (2012) used ReLU and reported training 6× faster than with tanh on the same architecture. This speed advantage compounded: faster training meant more experiments, more experiments meant faster progress.
ReLU has its own weakness: if a neuron’s input is always negative (perhaps due to a large negative bias or an unfortunate weight initialization), it outputs zero for all inputs and receives zero gradient. It is effectively “dead” and will never recover.
In practice, this means a trained network can have 10–50% of its neurons permanently dead, wasting capacity.
With $\alpha = 0.01$ typically. The small negative slope ensures gradients never fully vanish, preventing dead neurons.
Same as Leaky ReLU, but $\alpha$ is a learnable parameter for each channel. He et al. showed this improved ImageNet accuracy at negligible computational cost.
ELU has a smooth curve for negative inputs, producing negative outputs that push the mean activation closer to zero — similar to batch normalization’s effect.
SELU is carefully designed with specific values of $\alpha$ and a scaling factor $\lambda$ so that activations self-normalize — maintaining mean 0 and variance 1 through the network without batch normalization:
\[\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\]The Gaussian Error Linear Unit has become the default activation in Transformers:
\[\text{GELU}(x) = x \cdot \Phi(x)\]where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution.
Intuitively, GELU is a “soft” version of ReLU: instead of a hard threshold at zero, it smoothly transitions from blocking to passing signals. The stochastic interpretation is that each neuron is multiplied by a Bernoulli random variable whose probability depends on the input magnitude.
GELU is used in:
Discovered by Ramachandran et al. through automated search over activation function spaces. SiLU (also called Swish) is smooth, non-monotonic, and slightly outperforms ReLU on deep networks. It’s used in EfficientNet and many modern architectures.
The Gated Linear Unit takes the gating concept to another level:
\[\text{GLU}(x, W, V, b, c) = (xW + b) \otimes \sigma(xV + c)\]Here, $\otimes$ denotes element-wise multiplication. One linear transformation produces the “content,” and another produces a sigmoid “gate” that controls how much content passes through. This is a direct parallel to LSTM gates (Chapter 5) applied to feedforward layers.
Variants like SwiGLU (using SiLU instead of sigmoid for the gate) and GeGLU (using GELU) are now standard in large language models like LLaMA and PaLM.
While not a hidden-layer activation, softmax is the critical gate for classification:
\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]Softmax converts a vector of raw scores into a probability distribution. It also plays a central role in the attention mechanism (Chapter 8), where it determines how much weight to give each element.
| Year | Function | Key Innovation | Where Used |
|---|---|---|---|
| 1960s | Sigmoid | First smooth, differentiable gate | Early networks |
| 1990s | Tanh | Zero-centered outputs | RNNs, MLPs |
| 2010 | ReLU | Non-saturating, sparse | CNNs (AlexNet+) |
| 2013 | Leaky ReLU | Fixes dying neurons | Various |
| 2015 | PReLU | Learnable negative slope | ResNets |
| 2016 | GELU | Smooth probabilistic gating | Transformers |
| 2017 | SiLU/Swish | Automated discovery | EfficientNet |
| 2020 | SwiGLU | Gated linear units with SiLU | LLaMA, PaLM |
# See code/ch02_activations.py for full visualization
import torch
import torch.nn.functional as F
x = torch.linspace(-5, 5, 1000)
activations = {
"Sigmoid": torch.sigmoid(x),
"Tanh": torch.tanh(x),
"ReLU": F.relu(x),
"GELU": F.gelu(x),
"SiLU": F.silu(x),
}
The evolution of activation functions shows a recurring pattern in deep learning: simple ideas (like max(0, x)) often beat complex ones, and the right activation function can make the difference between a network that trains and one that doesn’t.
Previous Chapter: The Perceptron, Backpropagation, and Early Neural Networks
Next Chapter: Convolutional Neural Networks — From LeNet to AlexNet