Chapter 13: Diffusion Models — A New Paradigm for Generation (2020–present)

From GANs to Diffusion

By 2020, GANs were the undisputed champions of image generation. StyleGAN2 could produce photorealistic faces, and BigGAN generated impressive ImageNet samples. But GANs had persistent problems: mode collapse, training instability, and difficulty with diverse, multi-modal distributions.

Diffusion models offered a radically different approach that solved all of these problems — at the cost of slower sampling. Within two years, they surpassed GANs in both image quality and diversity, and powered the generative AI revolution of 2022–2024.

The Core Idea: Destroy and Reconstruct

The intuition behind diffusion models is elegant:

  1. Forward process (easy): Gradually add Gaussian noise to a real image until it becomes pure noise
  2. Reverse process (learned): Train a neural network to gradually remove noise, reconstructing the image step by step

If the network learns to reverse the noise process, you can start from pure random noise and generate new images by applying the reverse process iteratively.

The Mathematical Framework

Forward Process (Adding Noise)

Starting from a data sample $x_0$, the forward process adds Gaussian noise over $T$ steps according to a fixed schedule $\beta_1, …, \beta_T$:

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)\]

As $T \to \infty$, $x_T$ approaches a standard Gaussian $\mathcal{N}(0, I)$.

A key property: you can sample $x_t$ at any time step directly from $x_0$ (no need to iterate):

\[x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^{t} \alpha_s$.

Reverse Process (Removing Noise)

The reverse process learns to denoise:

\[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)\]

The neural network $\mu_\theta$ predicts the mean of the denoised distribution at each step.

The Noise Prediction Reformulation

Ho et al. (2020) showed that instead of predicting the mean $\mu_\theta$, the network can predict the noise $\epsilon_\theta(x_t, t)$ that was added. The training objective simplifies to:

\[L = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]\]

This is remarkably simple: sample a random image, add noise at a random time step, and train the network to predict what noise was added.

DDPM: Denoising Diffusion Probabilistic Models (2020)

Ho, Jain, and Abbeel published the paper that made diffusion models practical. Their key contributions:

  1. Simplified training objective: The noise prediction loss above
  2. Architecture: A U-Net with residual blocks, attention layers, and time-step conditioning
  3. Results: Generated images rivaling GANs in quality, with superior diversity

The U-Net Architecture for Diffusion

The noise prediction network is typically a U-Net (first introduced for medical image segmentation, Chapter 7):

x_t + t ──→ [DownBlock] → [DownBlock] → [MiddleBlock] → [UpBlock] → [UpBlock] → ε_θ
              ↓                                              ↑
              └──────── Skip Connections ────────────────────┘

Key components:

Sampling Process

To generate an image, start from pure noise and iterate the reverse process:

x_T ~ N(0, I)  →  x_{T-1}  →  ...  →  x_1  →  x_0 (generated image)

At each step: \(x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z\)

where $z \sim \mathcal{N}(0, I)$ is fresh noise (stochasticity).

Improved Sampling: DDIM (2020)

Song et al. introduced Denoising Diffusion Implicit Models (DDIM), which allow:

  1. Deterministic sampling: Remove the stochastic noise term, making generation reproducible
  2. Fewer steps: Skip steps in the reverse process (e.g., 50 steps instead of 1000) with minimal quality loss
  3. Latent space interpolation: Smoothly interpolate between generated images

This reduced sampling time from minutes to seconds, making diffusion models practical.

Classifier-Free Guidance (2022)

Ho and Salimans introduced classifier-free guidance, the key technique that made conditional generation work spectacularly well.

The Idea

Train the same diffusion model both conditionally (with a text prompt) and unconditionally (no prompt, using a null embedding). At sampling time, amplify the difference:

\[\hat{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))\]

where $w > 1$ is the guidance scale. Higher $w$ produces images that more closely match the prompt but with less diversity.

The guidance scale $w$ is the “creativity dial”:

Latent Diffusion Models and Stable Diffusion (2021–2022)

The Efficiency Problem

Running diffusion in pixel space is expensive: a 512×512 image has 786,432 dimensions. Rombach et al. proposed running diffusion in a compressed latent space:

  1. Encode: Use a pretrained autoencoder (VAE) to compress the image: $z = E(x)$, typically 64×64×4
  2. Diffuse: Run the diffusion process in latent space (much smaller)
  3. Decode: Use the VAE decoder to convert back to pixels: $\hat{x} = D(\hat{z})$

This reduces the dimensionality by ~48×, making training and sampling dramatically faster.

Stable Diffusion Architecture

Text Prompt → CLIP Text Encoder → Cross-Attention
                                       ↓
Random Noise z_T → U-Net (in latent space) → z_0 → VAE Decoder → Image

Stable Diffusion (2022) combined:

The result was the first high-quality, open-source text-to-image model, igniting the generative AI revolution.

DALL·E 2 and Imagen

DALL·E 2 (2022)

OpenAI’s system used a two-stage approach:

  1. A prior model generates CLIP image embeddings from text
  2. A diffusion decoder generates images from the CLIP embeddings

Imagen (2022)

Google’s Imagen used a text-to-image diffusion model with a frozen T5 text encoder (a large language model). Key finding: scaling the text encoder matters more than scaling the image model.

Beyond Images: Diffusion for Other Modalities

Video Generation

Audio Generation

3D Generation

Molecule and Protein Design

Diffusion Transformers (DiT, 2022)

Peebles and Xie replaced the U-Net with a Transformer as the diffusion backbone:

\[\text{DiT}(z_t, t, c) = \text{Transformer}(\text{patchify}(z_t) + \text{pos\_emb}, \text{AdaLN}(t, c))\]

Key innovations:

DiT is the architecture behind DALL·E 3, Sora, and Stable Diffusion 3.

Flow Matching (2022–2023)

Flow Matching (Lipman et al., 2022) simplifies diffusion by learning a velocity field that transports noise to data along straight paths:

\[\frac{dx_t}{dt} = v_\theta(x_t, t)\]

Instead of the complex noise schedule of diffusion models, flow matching uses a simple linear interpolation:

\[x_t = (1 - t) \cdot x_0 + t \cdot \epsilon\]

This is simpler to implement, faster to train, and produces comparable results. Stable Diffusion 3 and FLUX use flow matching.

Code Example: Simplified Diffusion Training Step

# See code/ch13_diffusion.py for full implementation
import torch
import torch.nn.functional as F

def training_step(model, x_0, noise_schedule):
    """One training step of a diffusion model."""
    # Sample random timestep
    t = torch.randint(0, len(noise_schedule), (x_0.shape[0],))
    # Sample noise
    noise = torch.randn_like(x_0)
    # Create noisy version
    alpha_bar = noise_schedule[t].view(-1, 1, 1, 1)
    x_t = torch.sqrt(alpha_bar) * x_0 + torch.sqrt(1 - alpha_bar) * noise
    # Predict noise
    predicted_noise = model(x_t, t)
    # Simple MSE loss
    return F.mse_loss(predicted_noise, noise)

Key Takeaways


Previous Chapter: Scaling Laws and Large Language Models

Next Chapter: Optimization Advances — Making Training Practical

Back to Table of Contents