Chapter 9: Generative Adversarial Networks (2014–2020)

A New Paradigm: Learning by Competition

In 2014, Ian Goodfellow introduced an idea that was radically different from anything before: instead of training a single network to minimize a loss function, train two networks that compete against each other.

The Generator tries to create fake data that looks real. The Discriminator tries to distinguish real data from fake. As each improves, the other must improve too, driving both toward excellence.

This adversarial framework — the Generative Adversarial Network (GAN) — launched an entirely new approach to generative modeling and produced some of the most visually stunning results in deep learning history.

The GAN Framework

Setup

The Minimax Game

\[\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\]

Intuition:

Training Algorithm

At each iteration:

  1. Sample a batch of real data ${x^{(1)}, …, x^{(m)}}$
  2. Sample a batch of noise ${z^{(1)}, …, z^{(m)}}$
  3. Update D: Maximize $\log D(x) + \log(1 - D(G(z)))$ (standard gradient ascent)
  4. Update G: Minimize $\log(1 - D(G(z)))$ (or equivalently, maximize $\log D(G(z))$)

Theoretical Result

Goodfellow proved that under ideal conditions (infinite capacity, infinite data, optimal discriminator), the generator converges to the true data distribution:

\[p_G = p_{\text{data}}\]

In practice, reaching this equilibrium is notoriously difficult.

The Training Challenges

GANs are famously hard to train. Three major failure modes:

1. Mode Collapse

The generator finds a few outputs that fool the discriminator and keeps producing only those. Instead of generating diverse images of dogs, it might produce the same dog over and over.

The discriminator eventually catches on, but the generator simply switches to a different small set of outputs — leading to oscillation rather than convergence.

2. Training Instability

The generator and discriminator are locked in a dynamic game. If one becomes much stronger than the other:

Balancing the training of both networks requires careful hyperparameter tuning.

3. Vanishing Gradients for G

In the original formulation, when $D$ is near-optimal, $D(G(z)) \approx 0$ and $\log(1 - D(G(z))) \approx 0$, giving nearly zero gradient to $G$. The practical fix: maximize $\log D(G(z))$ instead of minimizing $\log(1-D(G(z)))$, which provides stronger gradients early in training.

Key GAN Variants and Innovations

DCGAN: Deep Convolutional GAN (2015)

Radford et al. established architectural guidelines that stabilized GAN training:

These “DCGAN rules” became the standard recipe for CNN-based GANs.

Conditional GAN (cGAN, 2014)

Provide both G and D with additional information $y$ (class label, text description, etc.):

\[G(z, y), \quad D(x, y)\]

This allows controlling what the generator produces. For example: generate an image of a “red sports car” rather than a random image.

Pix2Pix: Image-to-Image Translation (2016)

Isola et al. combined conditional GANs with U-Net and L1 loss for paired image translation:

The discriminator operates on patches (PatchGAN) rather than the full image, judging whether local regions look real.

CycleGAN: Unpaired Image Translation (2017)

Zhu et al. solved an even harder problem: translating between image domains without paired examples. The key innovation is cycle consistency loss:

\[L_{\text{cycle}} = \|G_{B \to A}(G_{A \to B}(x_A)) - x_A\|_1 + \|G_{A \to B}(G_{B \to A}(x_B)) - x_B\|_1\]

If you translate a horse image to a zebra and back, you should get the original horse. This constraint, applied in both directions, prevents the generators from producing arbitrary outputs.

Famous applications:

Wasserstein GAN (WGAN, 2017)

Arjovsky et al. replaced the traditional GAN objective with the Wasserstein distance (Earth Mover’s distance):

\[\min_G \max_{D \in \mathcal{D}} \; \mathbb{E}_{x \sim p_{\text{data}}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))]\]

Key advantages:

The constraint $D \in \mathcal{D}$ (Lipschitz constraint) is enforced through gradient penalty (WGAN-GP):

\[\lambda \, \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]\]

where $\hat{x}$ is sampled along straight lines between real and generated data points.

Progressive GAN (ProGAN, 2017)

Karras et al. from NVIDIA introduced progressive growing: start with 4×4 resolution and gradually add layers during training to reach higher resolutions (up to 1024×1024).

This training strategy was crucial for generating high-resolution images — training a GAN directly at high resolution was unstable.

StyleGAN and StyleGAN2 (2018–2019)

Karras et al. continued with StyleGAN, which became the state of the art for face generation:

StyleGAN2 removed artifacts and improved quality further, producing faces indistinguishable from real photographs.

Spectral Normalization (2018)

Miyato et al. constrained the discriminator’s Lipschitz constant by normalizing weight matrices by their spectral norm (largest singular value). This simple technique stabilized training across many GAN variants.

The GAN Zoo: A Proliferation of Variants

The GAN framework inspired hundreds of variants:

GAN Variant Innovation Application
DCGAN CNN architecture guidelines General image generation
cGAN Conditional generation Controlled generation
Pix2Pix Paired image translation Image-to-image
CycleGAN Unpaired translation with cycle loss Style transfer
WGAN-GP Wasserstein distance + gradient penalty Stable training
ProGAN Progressive growing High-resolution images
StyleGAN Style-based generation Face synthesis
BigGAN Large-scale class-conditional ImageNet generation
StarGAN Multi-domain image translation Attribute transfer

GANs vs. Other Generative Models

Model Training Sample Quality Diversity Likelihood
GAN Adversarial Excellent (sharp) Mode collapse risk Not computed
VAE Reconstruction + KL Good (blurry) Good Approximate
Flow Maximum likelihood Good Good Exact
Diffusion Denoising Excellent Excellent Approximate

GANs produce the sharpest images but suffer from training instability and mode collapse. Diffusion models (Chapter 13) eventually surpassed GANs in both quality and diversity, though GANs remain faster at inference time.

Code Example: A Simple DCGAN Generator

# See code/ch09_gan.py for full training pipeline
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, channels=3):
        super().__init__()
        self.net = nn.Sequential(
            # Input: latent_dim → 4×4×512
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0),
            nn.BatchNorm2d(512), nn.ReLU(),
            # 4×4 → 8×8
            nn.ConvTranspose2d(512, 256, 4, 2, 1),
            nn.BatchNorm2d(256), nn.ReLU(),
            # 8×8 → 16×16
            nn.ConvTranspose2d(256, 128, 4, 2, 1),
            nn.BatchNorm2d(128), nn.ReLU(),
            # 16×16 → 32×32
            nn.ConvTranspose2d(128, channels, 4, 2, 1),
            nn.Tanh(),
        )

The Legacy of GANs

While diffusion models have surpassed GANs for most generative tasks by 2023, GANs left a lasting impact:

  1. Adversarial training as a general principle is used in many contexts (domain adaptation, robust training, RLHF)
  2. Image-to-image translation techniques developed for GANs are still widely used
  3. Latent space manipulation from StyleGAN influenced how we think about learned representations
  4. Perceptual and adversarial losses are used in many non-GAN systems (super-resolution, style transfer)

Key Takeaways


Previous Chapter: Attention Mechanisms and the Transformer

Next Chapter: Transfer Learning and Foundation Models

Back to Table of Contents