In 2014, Ian Goodfellow introduced an idea that was radically different from anything before: instead of training a single network to minimize a loss function, train two networks that compete against each other.
The Generator tries to create fake data that looks real. The Discriminator tries to distinguish real data from fake. As each improves, the other must improve too, driving both toward excellence.
This adversarial framework — the Generative Adversarial Network (GAN) — launched an entirely new approach to generative modeling and produced some of the most visually stunning results in deep learning history.
Intuition:
At each iteration:
Goodfellow proved that under ideal conditions (infinite capacity, infinite data, optimal discriminator), the generator converges to the true data distribution:
\[p_G = p_{\text{data}}\]In practice, reaching this equilibrium is notoriously difficult.
GANs are famously hard to train. Three major failure modes:
The generator finds a few outputs that fool the discriminator and keeps producing only those. Instead of generating diverse images of dogs, it might produce the same dog over and over.
The discriminator eventually catches on, but the generator simply switches to a different small set of outputs — leading to oscillation rather than convergence.
The generator and discriminator are locked in a dynamic game. If one becomes much stronger than the other:
Balancing the training of both networks requires careful hyperparameter tuning.
In the original formulation, when $D$ is near-optimal, $D(G(z)) \approx 0$ and $\log(1 - D(G(z))) \approx 0$, giving nearly zero gradient to $G$. The practical fix: maximize $\log D(G(z))$ instead of minimizing $\log(1-D(G(z)))$, which provides stronger gradients early in training.
Radford et al. established architectural guidelines that stabilized GAN training:
These “DCGAN rules” became the standard recipe for CNN-based GANs.
Provide both G and D with additional information $y$ (class label, text description, etc.):
\[G(z, y), \quad D(x, y)\]This allows controlling what the generator produces. For example: generate an image of a “red sports car” rather than a random image.
Isola et al. combined conditional GANs with U-Net and L1 loss for paired image translation:
The discriminator operates on patches (PatchGAN) rather than the full image, judging whether local regions look real.
Zhu et al. solved an even harder problem: translating between image domains without paired examples. The key innovation is cycle consistency loss:
\[L_{\text{cycle}} = \|G_{B \to A}(G_{A \to B}(x_A)) - x_A\|_1 + \|G_{A \to B}(G_{B \to A}(x_B)) - x_B\|_1\]If you translate a horse image to a zebra and back, you should get the original horse. This constraint, applied in both directions, prevents the generators from producing arbitrary outputs.
Famous applications:
Arjovsky et al. replaced the traditional GAN objective with the Wasserstein distance (Earth Mover’s distance):
\[\min_G \max_{D \in \mathcal{D}} \; \mathbb{E}_{x \sim p_{\text{data}}}[D(x)] - \mathbb{E}_{z \sim p_z}[D(G(z))]\]Key advantages:
The constraint $D \in \mathcal{D}$ (Lipschitz constraint) is enforced through gradient penalty (WGAN-GP):
\[\lambda \, \mathbb{E}_{\hat{x}}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]\]where $\hat{x}$ is sampled along straight lines between real and generated data points.
Karras et al. from NVIDIA introduced progressive growing: start with 4×4 resolution and gradually add layers during training to reach higher resolutions (up to 1024×1024).
This training strategy was crucial for generating high-resolution images — training a GAN directly at high resolution was unstable.
Karras et al. continued with StyleGAN, which became the state of the art for face generation:
StyleGAN2 removed artifacts and improved quality further, producing faces indistinguishable from real photographs.
Miyato et al. constrained the discriminator’s Lipschitz constant by normalizing weight matrices by their spectral norm (largest singular value). This simple technique stabilized training across many GAN variants.
The GAN framework inspired hundreds of variants:
| GAN Variant | Innovation | Application |
|---|---|---|
| DCGAN | CNN architecture guidelines | General image generation |
| cGAN | Conditional generation | Controlled generation |
| Pix2Pix | Paired image translation | Image-to-image |
| CycleGAN | Unpaired translation with cycle loss | Style transfer |
| WGAN-GP | Wasserstein distance + gradient penalty | Stable training |
| ProGAN | Progressive growing | High-resolution images |
| StyleGAN | Style-based generation | Face synthesis |
| BigGAN | Large-scale class-conditional | ImageNet generation |
| StarGAN | Multi-domain image translation | Attribute transfer |
| Model | Training | Sample Quality | Diversity | Likelihood |
|---|---|---|---|---|
| GAN | Adversarial | Excellent (sharp) | Mode collapse risk | Not computed |
| VAE | Reconstruction + KL | Good (blurry) | Good | Approximate |
| Flow | Maximum likelihood | Good | Good | Exact |
| Diffusion | Denoising | Excellent | Excellent | Approximate |
GANs produce the sharpest images but suffer from training instability and mode collapse. Diffusion models (Chapter 13) eventually surpassed GANs in both quality and diversity, though GANs remain faster at inference time.
# See code/ch09_gan.py for full training pipeline
import torch.nn as nn
class Generator(nn.Module):
def __init__(self, latent_dim=100, channels=3):
super().__init__()
self.net = nn.Sequential(
# Input: latent_dim → 4×4×512
nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0),
nn.BatchNorm2d(512), nn.ReLU(),
# 4×4 → 8×8
nn.ConvTranspose2d(512, 256, 4, 2, 1),
nn.BatchNorm2d(256), nn.ReLU(),
# 8×8 → 16×16
nn.ConvTranspose2d(256, 128, 4, 2, 1),
nn.BatchNorm2d(128), nn.ReLU(),
# 16×16 → 32×32
nn.ConvTranspose2d(128, channels, 4, 2, 1),
nn.Tanh(),
)
While diffusion models have surpassed GANs for most generative tasks by 2023, GANs left a lasting impact:
Previous Chapter: Attention Mechanisms and the Transformer
Next Chapter: Transfer Learning and Foundation Models