By 2020, GANs were the undisputed champions of image generation. StyleGAN2 could produce photorealistic faces, and BigGAN generated impressive ImageNet samples. But GANs had persistent problems: mode collapse, training instability, and difficulty with diverse, multi-modal distributions.
Diffusion models offered a radically different approach that solved all of these problems — at the cost of slower sampling. Within two years, they surpassed GANs in both image quality and diversity, and powered the generative AI revolution of 2022–2024.
The intuition behind diffusion models is elegant:
If the network learns to reverse the noise process, you can start from pure random noise and generate new images by applying the reverse process iteratively.
Starting from a data sample $x_0$, the forward process adds Gaussian noise over $T$ steps according to a fixed schedule $\beta_1, …, \beta_T$:
\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)\]As $T \to \infty$, $x_T$ approaches a standard Gaussian $\mathcal{N}(0, I)$.
A key property: you can sample $x_t$ at any time step directly from $x_0$ (no need to iterate):
\[x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^{t} \alpha_s$.
The reverse process learns to denoise:
\[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)\]The neural network $\mu_\theta$ predicts the mean of the denoised distribution at each step.
Ho et al. (2020) showed that instead of predicting the mean $\mu_\theta$, the network can predict the noise $\epsilon_\theta(x_t, t)$ that was added. The training objective simplifies to:
\[L = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]\]This is remarkably simple: sample a random image, add noise at a random time step, and train the network to predict what noise was added.
Ho, Jain, and Abbeel published the paper that made diffusion models practical. Their key contributions:
The noise prediction network is typically a U-Net (first introduced for medical image segmentation, Chapter 7):
x_t + t ──→ [DownBlock] → [DownBlock] → [MiddleBlock] → [UpBlock] → [UpBlock] → ε_θ
↓ ↑
└──────── Skip Connections ────────────────────┘
Key components:
To generate an image, start from pure noise and iterate the reverse process:
x_T ~ N(0, I) → x_{T-1} → ... → x_1 → x_0 (generated image)
At each step: \(x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z\)
where $z \sim \mathcal{N}(0, I)$ is fresh noise (stochasticity).
Song et al. introduced Denoising Diffusion Implicit Models (DDIM), which allow:
This reduced sampling time from minutes to seconds, making diffusion models practical.
Ho and Salimans introduced classifier-free guidance, the key technique that made conditional generation work spectacularly well.
Train the same diffusion model both conditionally (with a text prompt) and unconditionally (no prompt, using a null embedding). At sampling time, amplify the difference:
\[\hat{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))\]where $w > 1$ is the guidance scale. Higher $w$ produces images that more closely match the prompt but with less diversity.
The guidance scale $w$ is the “creativity dial”:
Running diffusion in pixel space is expensive: a 512×512 image has 786,432 dimensions. Rombach et al. proposed running diffusion in a compressed latent space:
This reduces the dimensionality by ~48×, making training and sampling dramatically faster.
Text Prompt → CLIP Text Encoder → Cross-Attention
↓
Random Noise z_T → U-Net (in latent space) → z_0 → VAE Decoder → Image
Stable Diffusion (2022) combined:
The result was the first high-quality, open-source text-to-image model, igniting the generative AI revolution.
OpenAI’s system used a two-stage approach:
Google’s Imagen used a text-to-image diffusion model with a frozen T5 text encoder (a large language model). Key finding: scaling the text encoder matters more than scaling the image model.
Peebles and Xie replaced the U-Net with a Transformer as the diffusion backbone:
\[\text{DiT}(z_t, t, c) = \text{Transformer}(\text{patchify}(z_t) + \text{pos\_emb}, \text{AdaLN}(t, c))\]Key innovations:
DiT is the architecture behind DALL·E 3, Sora, and Stable Diffusion 3.
Flow Matching (Lipman et al., 2022) simplifies diffusion by learning a velocity field that transports noise to data along straight paths:
\[\frac{dx_t}{dt} = v_\theta(x_t, t)\]Instead of the complex noise schedule of diffusion models, flow matching uses a simple linear interpolation:
\[x_t = (1 - t) \cdot x_0 + t \cdot \epsilon\]This is simpler to implement, faster to train, and produces comparable results. Stable Diffusion 3 and FLUX use flow matching.
# See code/ch13_diffusion.py for full implementation
import torch
import torch.nn.functional as F
def training_step(model, x_0, noise_schedule):
"""One training step of a diffusion model."""
# Sample random timestep
t = torch.randint(0, len(noise_schedule), (x_0.shape[0],))
# Sample noise
noise = torch.randn_like(x_0)
# Create noisy version
alpha_bar = noise_schedule[t].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar) * x_0 + torch.sqrt(1 - alpha_bar) * noise
# Predict noise
predicted_noise = model(x_t, t)
# Simple MSE loss
return F.mse_loss(predicted_noise, noise)
Previous Chapter: Scaling Laws and Large Language Models
Next Chapter: Optimization Advances — Making Training Practical