Chapter 1: The Perceptron, Backpropagation, and Early Neural Networks (1950s–1980s)

The Birth of an Idea

The story of deep learning begins with a simple question: can a machine learn from examples? In 1943, Warren McCulloch and Walter Pitts published a paper showing that networks of simplified neurons could, in principle, compute any logical function. Their model was binary — neurons either fired or didn’t — but the core insight was revolutionary: intelligence could emerge from the collective behavior of simple connected units.

The Perceptron (1958)

Frank Rosenblatt turned this theoretical idea into a practical learning machine. His perceptron was a single-layer neural network that could learn to classify inputs by adjusting its weights:

\[y = \text{sign}\left(\sum_{i=1}^{n} w_i x_i + b\right)\]

The perceptron learning rule was beautifully simple:

Present an input
If the output is correct, do nothing
If wrong, adjust the weights toward the correct answer

w_i ← w_i + η · (y_target - y_predicted) · x_i

This was the first time a machine could learn from data rather than being explicitly programmed. Rosenblatt famously predicted that perceptrons would eventually be able to “walk, talk, see, write, reproduce itself and be conscious of its existence.”

The Perceptron Convergence Theorem

An important theoretical result: if the data is linearly separable, the perceptron learning algorithm is guaranteed to converge to a solution in a finite number of steps. This gave the approach mathematical credibility.

The First AI Winter: Minsky and Papert (1969)

In 1969, Marvin Minsky and Seymour Papert published Perceptrons, a rigorous mathematical analysis showing that single-layer perceptrons could not learn the XOR function — or any non-linearly-separable problem.

The XOR problem is trivially simple:

x₁	x₂	XOR
0	0	0
0	1	1
1	0	1
1	1	0

No single hyperplane can separate the 1s from the 0s. Minsky and Papert knew that multi-layer networks could solve XOR, but they argued (correctly, at the time) that no one knew how to train them.

The impact was devastating. Funding dried up, researchers moved to other fields, and neural networks entered their first winter.

The Multi-Layer Perceptron

The solution to XOR was known conceptually: stack multiple layers. A two-layer network can solve XOR easily by creating intermediate representations:

Hidden neuron 1: learns (x₁ AND NOT x₂)
Hidden neuron 2: learns (NOT x₁ AND x₂)
Output neuron: learns (h₁ OR h₂)

But the critical question remained: how do you train the hidden layers? The perceptron learning rule only works when you know the desired output of each neuron — and for hidden neurons, the desired output is not specified.

Backpropagation: The Key That Unlocked Deep Networks (1986)

The answer is backpropagation — the chain rule of calculus applied systematically to compute how each weight in the network contributes to the error.

While the mathematical foundations existed earlier (Linnainmaa 1970, Werbos 1974), the 1986 paper by Rumelhart, Hinton, and Williams made backpropagation practical and demonstrated its power on real problems.

The Core Idea

For a network with loss function $L$, the gradient of the loss with respect to any weight $w$ in any layer can be computed by applying the chain rule backward through the network:

\[\frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial a_j^{(l)}} \cdot \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}}\]

Where:

$z_j^{(l)} = \sum_i w_{ij}^{(l)} a_i^{(l-1)} + b_j^{(l)}$ is the pre-activation
$a_j^{(l)} = \sigma(z_j^{(l)})$ is the activation
$\sigma$ is the activation function (sigmoid at this time)

The Algorithm

Forward pass: Compute the output of each layer from input to output
Compute loss: Compare the network output to the target
Backward pass: Compute gradients layer by layer, from output back to input
Update weights: $w \leftarrow w - \eta \frac{\partial L}{\partial w}$

Why Backpropagation Works

The key insight is that you don’t need to know the “correct” output for hidden neurons. Instead, you compute how much each hidden neuron contributed to the error at the output, and adjust it proportionally. The chain rule lets you decompose a complex credit assignment problem into a series of local computations.

The Universal Approximation Theorem (1989)

In 1989, George Cybenko proved that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$, given enough neurons.

This was a powerful theoretical result: neural networks are universal function approximators. However, it says nothing about:

How many neurons you need (it could be astronomically many)
Whether gradient descent will find the right weights
Whether the solution will generalize to unseen data

These practical limitations would drive decades of subsequent research.

Early Applications and Limitations

With backpropagation in hand, neural networks saw a revival in the late 1980s and early 1990s:

Handwritten digit recognition (LeCun, 1989) — leading to convolutional networks
Speech recognition — time-delay neural networks
Game playing — TD-Gammon for backgammon (Tesauro, 1992)

But serious problems remained:

Vanishing gradients: In deep networks, gradients shrink exponentially as they propagate backward through sigmoid activations (detailed in Chapter 4)
Limited compute: Training even modest networks was painfully slow on 1990s hardware
Limited data: Large labeled datasets didn’t exist yet
SVMs and kernel methods: Simpler methods with stronger theoretical guarantees (convex optimization) outperformed neural networks on many tasks

These factors led to the second AI winter for neural networks in the late 1990s, as support vector machines (SVMs) and random forests dominated machine learning conferences.

Code Example: The Perceptron and a Two-Layer Network

The code below demonstrates how a perceptron fails on XOR but a two-layer network solves it:

# See code/ch01_perceptron.py for the full implementation
# A single perceptron cannot solve XOR
# But a two-layer network with backpropagation can

import torch
import torch.nn as nn

# XOR dataset
X = torch.tensor([[0,0],[0,1],[1,0],[1,1]], dtype=torch.float32)
y = torch.tensor([[0],[1],[1],[0]], dtype=torch.float32)

model = nn.Sequential(
    nn.Linear(2, 4),
    nn.Sigmoid(),
    nn.Linear(4, 1),
    nn.Sigmoid()
)

Key Takeaways

The perceptron (1958) showed machines can learn from data, but is limited to linearly separable problems
Minsky & Papert (1969) exposed these limitations, triggering the first AI winter
Backpropagation (1986) solved the credit assignment problem for multi-layer networks using the chain rule
The Universal Approximation Theorem (1989) proved neural networks can represent any function — in theory
Practical limitations (vanishing gradients, limited compute, limited data) kept networks shallow and led to a second decline in the late 1990s

The foundations were laid. But it would take new activation functions, new architectures, and vastly more compute to unleash their true potential.

Previous Chapter: Introduction

Next Chapter: Activation Functions — The Gates That Shape Gradients

Back to Table of Contents