While architectures and datasets get the headlines, optimization is what makes everything work. The best architecture with the wrong optimizer, learning rate, or training strategy will fail completely. This chapter covers the optimization advances that run like a thread through the entire history of deep learning.
All of deep learning training is built on gradient descent: move parameters in the direction that reduces the loss:
\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)\]Full-batch gradient descent computes the gradient over the entire dataset — prohibitively expensive for large datasets. Stochastic Gradient Descent (SGD) approximates the gradient using a single sample or small mini-batch:
\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t; x_i, y_i)\]The noise in the gradient estimate is actually beneficial:
In practice, we use mini-batches of 32–4096 samples, balancing:
Plain SGD oscillates in narrow valleys of the loss landscape. Momentum smooths the trajectory by accumulating a running average of gradients:
\(v_t = \mu v_{t-1} + \nabla_\theta L(\theta_t)\) \(\theta_{t+1} = \theta_t - \eta v_t\)
where $\mu \approx 0.9$ is the momentum coefficient.
The physical analogy: a ball rolling down a hill accumulates momentum, speeding through flat regions and damping oscillations in steep valleys.
A refinement that “looks ahead” before computing the gradient:
\[v_t = \mu v_{t-1} + \nabla_\theta L(\theta_t - \eta \mu v_{t-1})\]Instead of computing the gradient at the current position, compute it at the anticipated next position. This provides a form of gradient correction that slightly improves convergence.
SGD with momentum (and sometimes Nesterov) remains competitive with adaptive methods and is still the optimizer of choice for training CNNs (ResNets, EfficientNets).
Different parameters need different learning rates:
Duchi et al. introduced per-parameter learning rates that decrease for frequently updated parameters:
\[\theta_{t+1,i} = \theta_{t,i} - \frac{\eta}{\sqrt{G_{t,ii} + \epsilon}} \cdot g_{t,i}\]where $G_{t,ii} = \sum_{\tau=1}^{t} g_{\tau,i}^2$ is the sum of all past squared gradients for parameter $i$.
Problem: The accumulated squared gradients grow monotonically, causing the learning rate to shrink to zero eventually.
Hinton proposed fixing AdaGrad by using an exponential moving average instead of a sum:
\(v_t = \rho v_{t-1} + (1 - \rho) g_t^2\) \(\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t\)
The learning rate adapts but doesn’t decay to zero. RMSProp was introduced informally in a lecture, never published in a paper, yet became widely used.
Kingma and Ba combined momentum with adaptive learning rates:
\(m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment / momentum)}\) \(v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment / adaptive LR)}\)
With bias correction (critical for early training steps):
\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]Update rule:
\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
Adam became the default optimizer for most of deep learning due to its robustness and minimal hyperparameter tuning.
Loshchilov and Hutter showed that L2 regularization in Adam doesn’t work as intended because the adaptive learning rate scales the regularization term differently for different parameters. AdamW applies weight decay directly:
\[\theta_{t+1} = (1 - \eta \lambda) \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]AdamW is the standard optimizer for training Transformers and LLMs.
| Optimizer | Adaptive LR | Momentum | Best For |
|---|---|---|---|
| SGD | No | Optional | CNNs, when you can tune LR well |
| SGD + Momentum | No | Yes | CNNs (ResNet training) |
| Adam | Yes | Yes | General default, Transformers |
| AdamW | Yes | Yes + decoupled WD | LLMs, Transformers |
| Adafactor | Yes (factored) | Optional | Memory-efficient LLM training |
The learning rate is arguably the single most important hyperparameter. Too high and training diverges; too low and it converges slowly or gets stuck.
Reduce the learning rate by a factor at fixed epochs:
\[\eta_t = \eta_0 \cdot \gamma^{\lfloor t / S \rfloor}\]Simple and effective, used in many ResNet training recipes.
Loshchilov and Hutter proposed a smooth cosine schedule:
\[\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)\]Cosine annealing became the dominant schedule for modern training.
Start with a very small learning rate and linearly increase it for the first few thousand steps:
\[\eta_t = \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}}\]Why warmup helps: At initialization, the model’s parameters are random and gradients are large and noisy. A high learning rate could cause immediate divergence. Warmup lets the model find a reasonable region of the loss landscape before turning up the learning rate.
Warmup is essential for training Transformers and is used in virtually all modern training recipes.
The combination of linear warmup followed by cosine decay is the standard for most modern deep learning:
Learning Rate
│
│ ╱‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│──╱ ╲──
└────────────────────────────────────
Warmup Cosine Decay
Smith and Topin proposed a “super-convergence” schedule:
This often achieves better results in fewer epochs than traditional schedules.
For RNNs and Transformers, gradient norms can spike unpredictably. Gradient clipping caps the gradient norm:
\[g \leftarrow \begin{cases} g & \text{if } \|g\| \leq \tau \\ \frac{\tau}{\|g\|} g & \text{if } \|g\| > \tau \end{cases}\]Typical values: $\tau = 1.0$ for language models. This is a simple but essential safeguard that prevents training instability.
Micikevicius et al. showed that training can use 16-bit floating point (FP16) for most operations while keeping a 32-bit master copy of the weights:
Benefits:
Google’s Brain Float16 format uses 8 exponent bits (same as FP32) and 7 mantissa bits. This provides the same dynamic range as FP32 with FP16 storage, reducing the need for loss scaling. BFloat16 is now the standard for LLM training.
Scaling to multiple GPUs means using larger batch sizes. But naively increasing batch size often degrades model quality.
The linear scaling rule: when you multiply the batch size by $k$, multiply the learning rate by $k$ too.
\[\eta_{\text{large}} = k \cdot \eta_{\text{base}}\]This compensates for the reduced noise in larger-batch gradients.
Layer-wise Adaptive Rate Scaling (LARS) and its Adam variant (LAMB) compute per-layer learning rates based on the ratio of weight norm to gradient norm:
\[\eta_l = \eta \cdot \frac{\|w_l\|}{\|g_l\|}\]LAMB enabled training BERT in 76 minutes using 1,024 TPUs, achieving the same quality as the original multi-day training.
When GPU memory is insufficient for the desired batch size, gradient accumulation simulates large batches:
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
This allows training with effective batch sizes of thousands on a single GPU.
Distribute batches across GPUs, each computing gradients independently, then average:
\[g = \frac{1}{K} \sum_{k=1}^{K} g_k\]For models too large to fit on one GPU, split the model across GPUs:
Rajbhandari et al. introduced Zero Redundancy Optimizer, which eliminates memory redundancy in data parallelism by partitioning optimizer states, gradients, and parameters across GPUs. ZeRO enabled training models with hundreds of billions of parameters.
# See code/ch14_optimization.py for full implementation
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
optimizer = AdamW(model.parameters(), lr=3e-4,
weight_decay=0.01, betas=(0.9, 0.95))
scheduler = CosineAnnealingLR(optimizer, T_max=total_steps)
scaler = torch.amp.GradScaler() # Mixed precision
for batch in dataloader:
with torch.amp.autocast(device_type='cuda'):
loss = model(batch)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
scheduler.step()
Previous Chapter: Diffusion Models
Next Chapter: Future Directions — What Comes Next