Chapter 13: Diffusion Models — A New Paradigm for Generation (2020–present)

From GANs to Diffusion

By 2020, GANs were the undisputed champions of image generation. StyleGAN2 could produce photorealistic faces, and BigGAN generated impressive ImageNet samples. But GANs had persistent problems: mode collapse, training instability, and difficulty with diverse, multi-modal distributions.

Diffusion models offered a radically different approach that solved all of these problems — at the cost of slower sampling. Within two years, they surpassed GANs in both image quality and diversity, and powered the generative AI revolution of 2022–2024.

The Core Idea: Destroy and Reconstruct

The intuition behind diffusion models is elegant:

Forward process (easy): Gradually add Gaussian noise to a real image until it becomes pure noise
Reverse process (learned): Train a neural network to gradually remove noise, reconstructing the image step by step

If the network learns to reverse the noise process, you can start from pure random noise and generate new images by applying the reverse process iteratively.

The Mathematical Framework

Forward Process (Adding Noise)

Starting from a data sample $x_0$, the forward process adds Gaussian noise over $T$ steps according to a fixed schedule $\beta_1, …, \beta_T$:

\[q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I)\]

As $T \to \infty$, $x_T$ approaches a standard Gaussian $\mathcal{N}(0, I)$.

A key property: you can sample $x_t$ at any time step directly from $x_0$ (no need to iterate):

\[x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^{t} \alpha_s$.

Reverse Process (Removing Noise)

The reverse process learns to denoise:

\[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)\]

The neural network $\mu_\theta$ predicts the mean of the denoised distribution at each step.

The Noise Prediction Reformulation

Ho et al. (2020) showed that instead of predicting the mean $\mu_\theta$, the network can predict the noise $\epsilon_\theta(x_t, t)$ that was added. The training objective simplifies to:

\[L = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]\]

This is remarkably simple: sample a random image, add noise at a random time step, and train the network to predict what noise was added.

DDPM: Denoising Diffusion Probabilistic Models (2020)

Ho, Jain, and Abbeel published the paper that made diffusion models practical. Their key contributions:

Simplified training objective: The noise prediction loss above
Architecture: A U-Net with residual blocks, attention layers, and time-step conditioning
Results: Generated images rivaling GANs in quality, with superior diversity

The U-Net Architecture for Diffusion

The noise prediction network is typically a U-Net (first introduced for medical image segmentation, Chapter 7):

x_t + t ──→ [DownBlock] → [DownBlock] → [MiddleBlock] → [UpBlock] → [UpBlock] → ε_θ
              ↓                                              ↑
              └──────── Skip Connections ────────────────────┘

Key components:

Residual blocks with group normalization
Self-attention at lower resolutions (16×16, 8×8)
Time embedding: The time step $t$ is encoded as a sinusoidal embedding (like positional encoding in Transformers) and injected into each residual block
Skip connections: Preserve high-resolution details from the encoder

Sampling Process

To generate an image, start from pure noise and iterate the reverse process:

x_T ~ N(0, I)  →  x_{T-1}  →  ...  →  x_1  →  x_0 (generated image)

At each step: $x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z$

where $z \sim \mathcal{N}(0, I)$ is fresh noise (stochasticity).

Improved Sampling: DDIM (2020)

Song et al. introduced Denoising Diffusion Implicit Models (DDIM), which allow:

Deterministic sampling: Remove the stochastic noise term, making generation reproducible
Fewer steps: Skip steps in the reverse process (e.g., 50 steps instead of 1000) with minimal quality loss
Latent space interpolation: Smoothly interpolate between generated images

This reduced sampling time from minutes to seconds, making diffusion models practical.

Classifier-Free Guidance (2022)

Ho and Salimans introduced classifier-free guidance, the key technique that made conditional generation work spectacularly well.

The Idea

Train the same diffusion model both conditionally (with a text prompt) and unconditionally (no prompt, using a null embedding). At sampling time, amplify the difference:

\[\hat{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))\]

where $w > 1$ is the guidance scale. Higher $w$ produces images that more closely match the prompt but with less diversity.

The guidance scale $w$ is the “creativity dial”:

$w = 1$: No guidance, maximum diversity
$w = 3$–$7$: Balanced quality and diversity (typical for generation)
$w > 10$: Very prompt-adherent but less diverse

Latent Diffusion Models and Stable Diffusion (2021–2022)

The Efficiency Problem

Running diffusion in pixel space is expensive: a 512×512 image has 786,432 dimensions. Rombach et al. proposed running diffusion in a compressed latent space:

Encode: Use a pretrained autoencoder (VAE) to compress the image: $z = E(x)$, typically 64×64×4
Diffuse: Run the diffusion process in latent space (much smaller)
Decode: Use the VAE decoder to convert back to pixels: $\hat{x} = D(\hat{z})$

This reduces the dimensionality by ~48×, making training and sampling dramatically faster.

Stable Diffusion Architecture

Text Prompt → CLIP Text Encoder → Cross-Attention
                                       ↓
Random Noise z_T → U-Net (in latent space) → z_0 → VAE Decoder → Image

Stable Diffusion (2022) combined:

Latent diffusion: Efficient latent-space diffusion
CLIP text encoder: Text conditioning through cross-attention
U-Net with cross-attention: Text features attend to image features at each resolution

The result was the first high-quality, open-source text-to-image model, igniting the generative AI revolution.

DALL·E 2 and Imagen

DALL·E 2 (2022)

OpenAI’s system used a two-stage approach:

A prior model generates CLIP image embeddings from text
A diffusion decoder generates images from the CLIP embeddings

Imagen (2022)

Google’s Imagen used a text-to-image diffusion model with a frozen T5 text encoder (a large language model). Key finding: scaling the text encoder matters more than scaling the image model.

Beyond Images: Diffusion for Other Modalities

Video Generation

Video Diffusion Models (2022): Extend 2D U-Net to 3D (space + time)
Sora (2024): OpenAI’s video generation model using a Diffusion Transformer (DiT) architecture
Key challenges: temporal consistency, long-form coherence

Audio Generation

AudioLDM and MusicGen: Generate audio and music from text descriptions using latent diffusion

3D Generation

DreamFusion (2022): Generate 3D objects from text using diffusion as a prior
Score Distillation Sampling (SDS): Use a pretrained 2D diffusion model to optimize a 3D representation

Molecule and Protein Design

Diffusion models for drug discovery and protein structure generation
Naturally suited for continuous geometric data

Diffusion Transformers (DiT, 2022)

Peebles and Xie replaced the U-Net with a Transformer as the diffusion backbone:

\[\text{DiT}(z_t, t, c) = \text{Transformer}(\text{patchify}(z_t) + \text{pos\_emb}, \text{AdaLN}(t, c))\]

Key innovations:

AdaLN (Adaptive Layer Normalization): Condition on time step and class by modulating LayerNorm parameters
Patchification: Split latent into patches (like ViT)
Scalability: DiT scales better than U-Nets with increased compute

DiT is the architecture behind DALL·E 3, Sora, and Stable Diffusion 3.

Flow Matching (2022–2023)

Flow Matching (Lipman et al., 2022) simplifies diffusion by learning a velocity field that transports noise to data along straight paths:

\[\frac{dx_t}{dt} = v_\theta(x_t, t)\]

Instead of the complex noise schedule of diffusion models, flow matching uses a simple linear interpolation:

\[x_t = (1 - t) \cdot x_0 + t \cdot \epsilon\]

This is simpler to implement, faster to train, and produces comparable results. Stable Diffusion 3 and FLUX use flow matching.

Code Example: Simplified Diffusion Training Step

# See code/ch13_diffusion.py for full implementation
import torch
import torch.nn.functional as F

def training_step(model, x_0, noise_schedule):
    """One training step of a diffusion model."""
    # Sample random timestep
    t = torch.randint(0, len(noise_schedule), (x_0.shape[0],))
    # Sample noise
    noise = torch.randn_like(x_0)
    # Create noisy version
    alpha_bar = noise_schedule[t].view(-1, 1, 1, 1)
    x_t = torch.sqrt(alpha_bar) * x_0 + torch.sqrt(1 - alpha_bar) * noise
    # Predict noise
    predicted_noise = model(x_t, t)
    # Simple MSE loss
    return F.mse_loss(predicted_noise, noise)

Key Takeaways

Diffusion models add noise gradually and learn to reverse the process, generating data from pure noise
DDPM (2020) made diffusion practical with a simplified noise-prediction training objective
DDIM enabled faster sampling with fewer steps and deterministic generation
Classifier-free guidance dramatically improved conditional generation quality
Latent diffusion (Stable Diffusion) runs in a compressed latent space for efficiency
Diffusion Transformers (DiT) replaced U-Nets for better scalability
Flow matching simplified the mathematical framework with straight-path interpolation
Diffusion models surpassed GANs in image quality and diversity, and extended to video, audio, 3D, and molecular design

Previous Chapter: Scaling Laws and Large Language Models

Next Chapter: Optimization Advances — Making Training Practical

Back to Table of Contents