Chapter 14: Optimization Advances — Making Training Practical

The Unsung Heroes of Deep Learning

While architectures and datasets get the headlines, optimization is what makes everything work. The best architecture with the wrong optimizer, learning rate, or training strategy will fail completely. This chapter covers the optimization advances that run like a thread through the entire history of deep learning.

Stochastic Gradient Descent (SGD)

The Foundation

All of deep learning training is built on gradient descent: move parameters in the direction that reduces the loss:

\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)\]

Full-batch gradient descent computes the gradient over the entire dataset — prohibitively expensive for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single sample or small mini-batch:

\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t; x_i, y_i)\]

The noise in the gradient estimate is actually beneficial:

Mini-Batch SGD

In practice, we use mini-batches of 32–4096 samples, balancing:

SGD with Momentum (1964/1999)

Plain SGD oscillates in narrow valleys of the loss landscape. Momentum smooths the trajectory by accumulating a running average of gradients:

\(v_t = \mu v_{t-1} + \nabla_\theta L(\theta_t)\) \(\theta_{t+1} = \theta_t - \eta v_t\)

where $\mu \approx 0.9$ is the momentum coefficient.

The physical analogy: a ball rolling down a hill accumulates momentum, speeding through flat regions and damping oscillations in steep valleys.

Nesterov Momentum (1983/2013)

A refinement that “looks ahead” before computing the gradient:

\[v_t = \mu v_{t-1} + \nabla_\theta L(\theta_t - \eta \mu v_{t-1})\]

Instead of computing the gradient at the current position, compute it at the anticipated next position. This provides a form of gradient correction that slightly improves convergence.

SGD with momentum (and sometimes Nesterov) remains competitive with adaptive methods and is still the optimizer of choice for training CNNs (ResNets, EfficientNets).

Adaptive Learning Rate Methods

The Problem

Different parameters need different learning rates:

AdaGrad (2011)

Duchi et al. introduced per-parameter learning rates that decrease for frequently updated parameters:

\[\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,ii} + \epsilon}} \cdot g_{t,i}\]

where $G_{t,ii} = \sum_{\tau=1}^{t} g_{\tau,i}^2$ is the sum of all past squared gradients for parameter $i$.

Problem: The accumulated squared gradients grow monotonically, causing the learning rate to shrink to zero eventually.

RMSProp (2012)

Hinton proposed fixing AdaGrad by using an exponential moving average instead of a sum:

\(v_t = \rho v_{t-1} + (1 - \rho) g_t^2\) \(\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t\)

The learning rate adapts but doesn’t decay to zero. RMSProp was introduced informally in a lecture, never published in a paper, yet became widely used.

Adam: Adaptive Moment Estimation (2014)

Kingma and Ba combined momentum with adaptive learning rates:

\(m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment / momentum)}\) \(v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment / adaptive LR)}\)

With bias correction (critical for early training steps):

\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]

Update rule:

\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.

Adam became the default optimizer for most of deep learning due to its robustness and minimal hyperparameter tuning.

AdamW: Decoupled Weight Decay (2017)

Loshchilov and Hutter showed that L2 regularization in Adam doesn’t work as intended because the adaptive learning rate scales the regularization term differently for different parameters. AdamW applies weight decay directly:

\[\theta_{t+1} = (1 - \eta \lambda) \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]

AdamW is the standard optimizer for training Transformers and LLMs.

Comparison of Optimizers

Optimizer Adaptive LR Momentum Best For
SGD No Optional CNNs, when you can tune LR well
SGD + Momentum No Yes CNNs (ResNet training)
Adam Yes Yes General default, Transformers
AdamW Yes Yes + decoupled WD LLMs, Transformers
Adafactor Yes (factored) Optional Memory-efficient LLM training

Learning Rate Scheduling

The learning rate is arguably the single most important hyperparameter. Too high and training diverges; too low and it converges slowly or gets stuck.

Step Decay

Reduce the learning rate by a factor at fixed epochs:

\[\eta_t = \eta_0 \cdot \gamma^{\lfloor t / S \rfloor}\]

Simple and effective, used in many ResNet training recipes.

Cosine Annealing (2016)

Loshchilov and Hutter proposed a smooth cosine schedule:

\[\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)\]

Cosine annealing became the dominant schedule for modern training.

Warmup

Start with a very small learning rate and linearly increase it for the first few thousand steps:

\[\eta_t = \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}}\]

Why warmup helps: At initialization, the model’s parameters are random and gradients are large and noisy. A high learning rate could cause immediate divergence. Warmup lets the model find a reasonable region of the loss landscape before turning up the learning rate.

Warmup is essential for training Transformers and is used in virtually all modern training recipes.

Warmup + Cosine Decay (The Modern Standard)

The combination of linear warmup followed by cosine decay is the standard for most modern deep learning:

Learning Rate
    │
    │        ╱‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾╲
    │       ╱                   ╲
    │      ╱                     ╲
    │     ╱                       ╲
    │    ╱                         ╲
    │   ╱                           ╲
    │──╱                             ╲──
    └────────────────────────────────────
         Warmup      Cosine Decay

One-Cycle Policy (2017)

Smith and Topin proposed a “super-convergence” schedule:

  1. Warmup to a high learning rate
  2. Decay to a very low learning rate (lower than the initial)

This often achieves better results in fewer epochs than traditional schedules.

Gradient Clipping

For RNNs and Transformers, gradient norms can spike unpredictably. Gradient clipping caps the gradient norm:

\[g \leftarrow \begin{cases} g & \text{if } \|g\| \leq \tau \\ \frac{\tau}{\|g\|} g & \text{if } \|g\| > \tau \end{cases}\]

Typical values: $\tau = 1.0$ for language models. This is a simple but essential safeguard that prevents training instability.

Mixed-Precision Training (2017)

Micikevicius et al. showed that training can use 16-bit floating point (FP16) for most operations while keeping a 32-bit master copy of the weights:

  1. Store weights in FP32
  2. Cast to FP16 for forward and backward pass
  3. Compute gradients in FP16
  4. Update FP32 master weights

Benefits:

BFloat16

Google’s Brain Float16 format uses 8 exponent bits (same as FP32) and 7 mantissa bits. This provides the same dynamic range as FP32 with FP16 storage, reducing the need for loss scaling. BFloat16 is now the standard for LLM training.

Large-Batch Training

The Challenge

Scaling to multiple GPUs means using larger batch sizes. But naively increasing batch size often degrades model quality.

Learning Rate Scaling

The linear scaling rule: when you multiply the batch size by $k$, multiply the learning rate by $k$ too.

\[\eta_{\text{large}} = k \cdot \eta_{\text{base}}\]

This compensates for the reduced noise in larger-batch gradients.

LARS and LAMB (2017–2019)

Layer-wise Adaptive Rate Scaling (LARS) and its Adam variant (LAMB) compute per-layer learning rates based on the ratio of weight norm to gradient norm:

\[\eta_l = \eta \cdot \frac{\|w_l\|}{\|g_l\|}\]

LAMB enabled training BERT in 76 minutes using 1,024 TPUs, achieving the same quality as the original multi-day training.

Gradient Accumulation

When GPU memory is insufficient for the desired batch size, gradient accumulation simulates large batches:

for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

This allows training with effective batch sizes of thousands on a single GPU.

Modern Training Infrastructure

Data Parallelism

Distribute batches across GPUs, each computing gradients independently, then average:

\[g = \frac{1}{K} \sum_{k=1}^{K} g_k\]

Model Parallelism

For models too large to fit on one GPU, split the model across GPUs:

ZeRO (2019)

Rajbhandari et al. introduced Zero Redundancy Optimizer, which eliminates memory redundancy in data parallelism by partitioning optimizer states, gradients, and parameters across GPUs. ZeRO enabled training models with hundreds of billions of parameters.

Code Example: Training Loop with Modern Best Practices

# See code/ch14_optimization.py for full implementation
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = AdamW(model.parameters(), lr=3e-4,
                  weight_decay=0.01, betas=(0.9, 0.95))
scheduler = CosineAnnealingLR(optimizer, T_max=total_steps)

scaler = torch.amp.GradScaler()  # Mixed precision
for batch in dataloader:
    with torch.amp.autocast(device_type='cuda'):
        loss = model(batch)
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()
    scheduler.step()

Key Takeaways


Previous Chapter: Diffusion Models

Next Chapter: Future Directions — What Comes Next

Back to Table of Contents