Chapter 9: Generative Adversarial Networks (2014–2020)

A New Paradigm: Learning by Competition

In 2014, Ian Goodfellow introduced an idea that was radically different from anything before: instead of training a single network to minimize a loss function, train two networks that compete against each other.

The Generator tries to create fake data that looks real. The Discriminator tries to distinguish real data from fake. As each improves, the other must improve too, driving both toward excellence.

This adversarial framework — the Generative Adversarial Network (GAN) — launched an entirely new approach to generative modeling and produced some of the most visually stunning results in deep learning history.

The GAN Framework

Setup

Generator $G$: Takes random noise $z \sim p_z(z)$ and produces fake data $G(z)$
Discriminator $D$: Takes data $x$ and outputs the probability that $x$ is real (not generated)

The Minimax Game

\[\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\]

Intuition:

$D$ wants to maximize: assign high probability to real data, low probability to fake data
$G$ wants to minimize: make $D$ assign high probability to fake data

Training Algorithm

At each iteration:

Sample a batch of real data ${x^{(1)}, …, x^{(m)}}$
Sample a batch of noise ${z^{(1)}, …, z^{(m)}}$
Update D: Maximize $\log D(x) + \log(1 - D(G(z)))$ (standard gradient ascent)
Update G: Minimize $\log(1 - D(G(z)))$ (or equivalently, maximize $\log D(G(z))$)

Theoretical Result

Goodfellow proved that under ideal conditions (infinite capacity, infinite data, optimal discriminator), the generator converges to the true data distribution:

\[p_G = p_{\text{data}}\]

In practice, reaching this equilibrium is notoriously difficult.

The Training Challenges

GANs are famously hard to train. Three major failure modes:

1. Mode Collapse

The generator finds a few outputs that fool the discriminator and keeps producing only those. Instead of generating diverse images of dogs, it might produce the same dog over and over.

The discriminator eventually catches on, but the generator simply switches to a different small set of outputs — leading to oscillation rather than convergence.

2. Training Instability

The generator and discriminator are locked in a dynamic game. If one becomes much stronger than the other:

D too strong: Gradients for G vanish (D confidently rejects everything G produces)
G too strong: D can’t provide useful feedback (always accepts everything)

Balancing the training of both networks requires careful hyperparameter tuning.

3. Vanishing Gradients for G

In the original formulation, when $D$ is near-optimal, $D(G(z)) \approx 0$ and $\log(1 - D(G(z))) \approx 0$, giving nearly zero gradient to $G$. The practical fix: maximize $\log D(G(z))$ instead of minimizing $\log(1-D(G(z)))$, which provides stronger gradients early in training.

Key GAN Variants and Innovations

DCGAN: Deep Convolutional GAN (2015)

Radford et al. established architectural guidelines that stabilized GAN training:

Use strided convolutions instead of pooling (in D) and transposed convolutions (in G)
Use BatchNorm in both G and D (except G’s output and D’s input)
Use ReLU in G (except tanh on the output)
Use LeakyReLU in D
No fully connected layers

These “DCGAN rules” became the standard recipe for CNN-based GANs.

Conditional GAN (cGAN, 2014)

Provide both G and D with additional information $y$ (class label, text description, etc.):

\[G(z, y), \quad D(x, y)\]

This allows controlling what the generator produces. For example: generate an image of a “red sports car” rather than a random image.

Pix2Pix: Image-to-Image Translation (2016)

Isola et al. combined conditional GANs with U-Net and L1 loss for paired image translation:

Satellite photos → maps
Sketches → photographs
Day → night
Black and white → color

The discriminator operates on patches (PatchGAN) rather than the full image, judging whether local regions look real.

CycleGAN: Unpaired Image Translation (2017)

Zhu et al. solved an even harder problem: translating between image domains without paired examples. The key innovation is cycle consistency loss:

\[L_{\text{cycle}} = \|G_{B \to A}(G_{A \to B}(x_A)) - x_A\|_1 + \|G_{A \to B}(G_{B \to A}(x_B)) - x_B\|_1\]

If you translate a horse image to a zebra and back, you should get the original horse. This constraint, applied in both directions, prevents the generators from producing arbitrary outputs.

Famous applications:

Horses ↔ Zebras
Summer ↔ Winter landscapes
Photos ↔ Monet paintings

Wasserstein GAN (WGAN, 2017)

Arjovsky et al. replaced the traditional GAN objective with the Wasserstein distance (Earth Mover’s distance):

\[\min_G \max_{D \in \mathcal{D}} \; \mathbb{E}_{x \sim p_{\text{data}}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))]\]

Key advantages:

The loss correlates with image quality (unlike the original GAN loss)
More stable training — no mode collapse
The discriminator (now called a “critic”) doesn’t need to balance with the generator

The constraint $D \in \mathcal{D}$ (Lipschitz constraint) is enforced through gradient penalty (WGAN-GP):

\[\lambda \, \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]\]

where $\hat{x}$ is sampled along straight lines between real and generated data points.

Progressive GAN (ProGAN, 2017)

Karras et al. from NVIDIA introduced progressive growing: start with 4×4 resolution and gradually add layers during training to reach higher resolutions (up to 1024×1024).

This training strategy was crucial for generating high-resolution images — training a GAN directly at high resolution was unstable.

StyleGAN and StyleGAN2 (2018–2019)

Karras et al. continued with StyleGAN, which became the state of the art for face generation:

Mapping network: Transforms noise $z$ into an intermediate latent space $w$
Adaptive Instance Normalization (AdaIN): Injects style at each layer
Style mixing: Different layers control different attributes (coarse: pose, face shape; fine: color, texture)

StyleGAN2 removed artifacts and improved quality further, producing faces indistinguishable from real photographs.

Spectral Normalization (2018)

Miyato et al. constrained the discriminator’s Lipschitz constant by normalizing weight matrices by their spectral norm (largest singular value). This simple technique stabilized training across many GAN variants.

The GAN Zoo: A Proliferation of Variants

The GAN framework inspired hundreds of variants:

GAN Variant	Innovation	Application
DCGAN	CNN architecture guidelines	General image generation
cGAN	Conditional generation	Controlled generation
Pix2Pix	Paired image translation	Image-to-image
CycleGAN	Unpaired translation with cycle loss	Style transfer
WGAN-GP	Wasserstein distance + gradient penalty	Stable training
ProGAN	Progressive growing	High-resolution images
StyleGAN	Style-based generation	Face synthesis
BigGAN	Large-scale class-conditional	ImageNet generation
StarGAN	Multi-domain image translation	Attribute transfer

GANs vs. Other Generative Models

Model	Training	Sample Quality	Diversity	Likelihood
GAN	Adversarial	Excellent (sharp)	Mode collapse risk	Not computed
VAE	Reconstruction + KL	Good (blurry)	Good	Approximate
Flow	Maximum likelihood	Good	Good	Exact
Diffusion	Denoising	Excellent	Excellent	Approximate

GANs produce the sharpest images but suffer from training instability and mode collapse. Diffusion models (Chapter 13) eventually surpassed GANs in both quality and diversity, though GANs remain faster at inference time.

Code Example: A Simple DCGAN Generator

# See code/ch09_gan.py for full training pipeline
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, channels=3):
        super().__init__()
        self.net = nn.Sequential(
            # Input: latent_dim → 4×4×512
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0),
            nn.BatchNorm2d(512), nn.ReLU(),
            # 4×4 → 8×8
            nn.ConvTranspose2d(512, 256, 4, 2, 1),
            nn.BatchNorm2d(256), nn.ReLU(),
            # 8×8 → 16×16
            nn.ConvTranspose2d(256, 128, 4, 2, 1),
            nn.BatchNorm2d(128), nn.ReLU(),
            # 16×16 → 32×32
            nn.ConvTranspose2d(128, channels, 4, 2, 1),
            nn.Tanh(),
        )

The Legacy of GANs

While diffusion models have surpassed GANs for most generative tasks by 2023, GANs left a lasting impact:

Adversarial training as a general principle is used in many contexts (domain adaptation, robust training, RLHF)
Image-to-image translation techniques developed for GANs are still widely used
Latent space manipulation from StyleGAN influenced how we think about learned representations
Perceptual and adversarial losses are used in many non-GAN systems (super-resolution, style transfer)

Key Takeaways

GANs (2014) introduced adversarial training: a generator and discriminator competing in a minimax game
DCGAN (2015) established architectural guidelines for stable CNN-based GANs
Mode collapse and training instability are the fundamental challenges
WGAN (2017) used Wasserstein distance for more stable training and meaningful loss curves
CycleGAN (2017) enabled unpaired domain translation through cycle consistency
StyleGAN (2018) achieved photorealistic face generation through style-based architecture
GANs pioneered deep generative modeling, but were eventually surpassed by diffusion models in quality and diversity

Previous Chapter: Attention Mechanisms and the Transformer

Next Chapter: Transfer Learning and Foundation Models

Back to Table of Contents