The story of deep learning begins with a simple question: can a machine learn from examples? In 1943, Warren McCulloch and Walter Pitts published a paper showing that networks of simplified neurons could, in principle, compute any logical function. Their model was binary — neurons either fired or didn’t — but the core insight was revolutionary: intelligence could emerge from the collective behavior of simple connected units.
Frank Rosenblatt turned this theoretical idea into a practical learning machine. His perceptron was a single-layer neural network that could learn to classify inputs by adjusting its weights:
\[y = \text{sign}\left(\sum_{i=1}^{n} w_i x_i + b\right)\]The perceptron learning rule was beautifully simple:
w_i ← w_i + η · (y_target - y_predicted) · x_i
This was the first time a machine could learn from data rather than being explicitly programmed. Rosenblatt famously predicted that perceptrons would eventually be able to “walk, talk, see, write, reproduce itself and be conscious of its existence.”
An important theoretical result: if the data is linearly separable, the perceptron learning algorithm is guaranteed to converge to a solution in a finite number of steps. This gave the approach mathematical credibility.
In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a rigorous mathematical analysis showing that single-layer perceptrons could not learn the XOR function — or any non-linearly-separable problem.
The XOR problem is trivially simple:
| x₁ | x₂ | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
No single hyperplane can separate the 1s from the 0s. Minsky and Papert knew that multi-layer networks could solve XOR, but they argued (correctly, at the time) that no one knew how to train them.
The impact was devastating. Funding dried up, researchers moved to other fields, and neural networks entered their first winter.
The solution to XOR was known conceptually: stack multiple layers. A two-layer network can solve XOR easily by creating intermediate representations:
But the critical question remained: how do you train the hidden layers? The perceptron learning rule only works when you know the desired output of each neuron — and for hidden neurons, the desired output is not specified.
The answer is backpropagation — the chain rule of calculus applied systematically to compute how each weight in the network contributes to the error.
While the mathematical foundations existed earlier (Linnainmaa 1970, Werbos 1974), the 1986 paper by Rumelhart, Hinton, and Williams made backpropagation practical and demonstrated its power on real problems.
For a network with loss function $L$, the gradient of the loss with respect to any weight $w$ in any layer can be computed by applying the chain rule backward through the network:
\[\frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial a_j^{(l)}} \cdot \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}}\]Where:
The key insight is that you don’t need to know the “correct” output for hidden neurons. Instead, you compute how much each hidden neuron contributed to the error at the output, and adjust it proportionally. The chain rule lets you decompose a complex credit assignment problem into a series of local computations.
In 1989, George Cybenko proved that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$, given enough neurons.
This was a powerful theoretical result: neural networks are universal function approximators. However, it says nothing about:
These practical limitations would drive decades of subsequent research.
With backpropagation in hand, neural networks saw a revival in the late 1980s and early 1990s:
But serious problems remained:
These factors led to the second AI winter for neural networks in the late 1990s, as support vector machines (SVMs) and random forests dominated machine learning conferences.
The code below demonstrates how a perceptron fails on XOR but a two-layer network solves it:
# See code/ch01_perceptron.py for the full implementation
# A single perceptron cannot solve XOR
# But a two-layer network with backpropagation can
import torch
import torch.nn as nn
# XOR dataset
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)
model = nn.Sequential(
nn.Linear(2, 4),
nn.Sigmoid(),
nn.Linear(4, 1),
nn.Sigmoid()
)
The foundations were laid. But it would take new activation functions, new architectures, and vastly more compute to unleash their true potential.
Previous Chapter: Introduction
Next Chapter: Activation Functions — The Gates That Shape Gradients