Chapter 0: Introduction — The Arc of Deep Learning

A Story of Ideas, Not Just Algorithms

Deep learning did not emerge overnight. Its history spans more than seven decades — a winding path of brilliant insights, crushing disappointments, forgotten papers, and sudden breakthroughs. To understand where we are today, with models that write code, generate photorealistic images, and carry on conversations, we need to understand the chain of ideas that made it all possible.

This book tells that story chronologically, focusing on the key gates and strategies that unlocked each new era of capability.

What Do We Mean by “Gates”?

Throughout this book, the word “gate” appears in multiple contexts, and that is intentional. Gates are the central metaphor of deep learning progress:

Activation gates: Sigmoid, tanh, and ReLU functions that decide which signals pass through a neuron
Memory gates: The forget, input, and output gates of LSTMs that control information flow through time
Residual gates: Skip connections that allow gradients to flow through very deep networks
Attention gates: Mechanisms that decide which parts of the input to focus on
Gating in mixture-of-experts: Routing functions that decide which sub-network processes each input

At every stage of deep learning’s evolution, progress came from finding better ways to control the flow of information and gradients through neural networks.

The Three Eras of Deep Learning

Era 1: Foundations (1943–2006)

The story begins with McCulloch and Pitts’ mathematical model of a neuron in 1943, continues through Rosenblatt’s perceptron in 1958, and nearly dies during the “AI winters” of the 1970s and 1990s. Key milestones:

1943 — McCulloch-Pitts neuron: the idea that computation can emerge from connected simple units
1958 — Rosenblatt’s perceptron: the first trainable neural network
1969 — Minsky & Papert’s Perceptrons: exposed limitations of single-layer networks, triggering the first AI winter
1986 — Rumelhart, Hinton & Williams: backpropagation made practical, reigniting the field
1989 — LeCun’s LeNet: convolutional networks for handwritten digit recognition
1997 — Hochreiter & Schmidhuber’s LSTM: gating mechanisms for long-term memory
2006 — Hinton’s deep belief networks: showed deep networks could be pretrained layer by layer

Era 2: The Deep Learning Renaissance (2012–2017)

Everything changed in 2012 when AlexNet won the ImageNet competition by a massive margin, proving that deep neural networks trained on GPUs could outperform all traditional methods. This triggered an explosion of innovation:

2012 — AlexNet: deep CNNs + GPUs + ReLU + Dropout
2014 — GANs: adversarial training for generative models
2014 — GRU: simplified gating for recurrent networks
2015 — ResNet: skip connections enabling 152-layer networks
2015 — Batch Normalization: stabilizing training of deep networks
2017 — Transformer: “Attention Is All You Need” — replacing recurrence with self-attention

Era 3: The Scaling Era (2018–present)

The Transformer unlocked a new paradigm: scale the model, scale the data, and capabilities emerge. This era is defined by foundation models, self-supervised learning, and the surprising power of simply making things bigger:

2018 — BERT and GPT: pretrained language models
2020 — GPT-3: 175 billion parameters, few-shot learning
2020 — DDPM: denoising diffusion models for image generation
2021 — DALL·E and CLIP: connecting vision and language
2022 — ChatGPT: large language models go mainstream
2023 — GPT-4, Mixtral: multimodal models and mixture-of-experts
2024–2025 — State space models, test-time compute, reasoning models

What This Book Covers

Each chapter focuses on one major breakthrough or family of related innovations:

Chapter	Topic	Key Period
1	Perceptrons and backpropagation	1950s–1980s
2	Activation functions	1960s–2015
3	Convolutional neural networks	1989–2012
4	The vanishing gradient problem	1991–2015
5	Recurrent networks and gating (LSTM/GRU)	1997–2014
6	Regularization strategies	2012–2016
7	Residual networks and skip connections	2015
8	Attention and Transformers	2014–2017
9	Generative adversarial networks	2014–2020
10	Transfer learning and foundation models	2018–present
11	Self-supervised learning	2018–present
12	Scaling laws and large language models	2020–present
13	Diffusion models	2020–present
14	Optimization advances	Throughout
15	Future directions	2024+

How to Read This Book

The chapters are designed to be read in order, since later breakthroughs build on earlier ones. However, each chapter is self-contained enough that you can jump to a topic of interest if you already have the background.

Code examples throughout the book use Python and PyTorch, kept short and focused on illustrating the core idea rather than providing production-ready implementations.

The Central Thesis

If there is one unifying thread in deep learning’s history, it is this:

Progress comes from finding better ways to move information and gradients through networks.

Every major breakthrough — from activation functions to skip connections to attention mechanisms — is fundamentally about solving an information flow problem. Keep this lens in mind as you read, and the entire arc of deep learning will make more sense.

Let’s begin.

Next Chapter: The Perceptron, Backpropagation, and Early Neural Networks

Back to Table of Contents