Chapter 12: Scaling Laws and Large Language Models

The Surprising Power of Scale

For decades, the deep learning community focused on architectural innovations — better layers, better training tricks, better losses. But starting around 2020, a startling empirical finding changed the conversation: simply making models bigger, training them longer, and feeding them more data produces predictably better performance.

This wasn’t just “more is better” in a vague sense. The improvements followed precise mathematical laws.

Scaling Laws (2020)

Kaplan et al. from OpenAI published “Scaling Laws for Neural Language Models,” showing that the cross-entropy loss of a language model follows a power law with respect to three factors:

\[L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}\]

where:

Key Findings

  1. Performance scales smoothly with model size, data, and compute — no diminishing returns within the tested range
  2. Model size matters most: For a fixed compute budget, it’s better to train a larger model for fewer steps than a smaller model for more steps
  3. Architecture details matter less: Width, depth, and head count have secondary effects compared to total parameter count
  4. The relationship is predictable: You can forecast the performance of a model you haven’t trained yet

Chinchilla Scaling Laws (2022)

Hoffmann et al. from DeepMind refined these scaling laws and found that Kaplan et al. were undertrained — they used too many parameters relative to the amount of data:

Chinchilla’s rule: For compute-optimal training, the number of tokens should scale proportionally to the number of parameters:

\[D_{\text{optimal}} \approx 20 \times N\]

A 70B parameter model should see approximately 1.4 trillion tokens. Many existing models (including GPT-3 and PaLM) were trained with too few tokens relative to their size.

This insight led to Chinchilla (70B parameters, 1.4T tokens), which outperformed Gopher (280B parameters, 300B tokens) while being 4× smaller. It also influenced LLaMA’s training strategy.

The GPT Series: A Case Study in Scaling

GPT-1 (2018)

GPT-2 (2019)

GPT-3 (2020)

Translate English to French:
sea otter => loutre de mer
plush toy => jouet en peluche
cheese => 

GPT-3 produces: “fromage”

Few-Shot, One-Shot, Zero-Shot

Setting Description GPT-3 Performance
Zero-shot Task description only Good
One-shot One example Better
Few-shot 5–100 examples Often competitive with fine-tuned models

This was profound: GPT-3 could perform tasks it was never explicitly trained for, simply from examples in the prompt. No gradient updates, no task-specific data.

InstructGPT and ChatGPT (2022)

GPT-3 was powerful but uncontrolled — it would follow harmful instructions, produce toxic content, or confidently make up facts. OpenAI addressed this through Reinforcement Learning from Human Feedback (RLHF):

  1. Supervised fine-tuning: Train on human-written demonstrations of helpful responses
  2. Reward model training: Train a model to predict which response humans prefer
  3. PPO optimization: Use the reward model to optimize the language model with Proximal Policy Optimization

The result was InstructGPT (and later ChatGPT), which was more helpful, less harmful, and better at following instructions — despite having fewer parameters than GPT-3.

GPT-4 (2023)

GPT-4 was a massive leap:

Emergent Abilities

Wei et al. (2022) documented emergent abilities — capabilities that appear suddenly at certain model scales:

These abilities don’t improve gradually — they appear to jump from near-zero to competent at critical scale thresholds. (Though later work debated whether this is an artifact of evaluation metrics.)

Mixture of Experts (MoE)

The Efficiency Problem

A 175B parameter model requires enormous compute for every token. But does every token really need all 175B parameters? A question about physics shouldn’t activate the same parameters as one about cooking.

The Solution: Sparse Activation

A Mixture of Experts model has many “expert” sub-networks but only activates a subset for each input:

\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]

where $g(x)$ is a gating function (a router) that selects which experts to activate:

\[g(x) = \text{TopK}(\text{softmax}(W_g x))\]

Switch Transformer (2021)

Fedus et al. simplified MoE by routing each token to exactly one expert (top-1 routing). With 1.6 trillion parameters but only activating ~100B per token, the Switch Transformer achieved better performance than dense models at lower computational cost.

Mixtral (2023)

Mistral AI’s Mixtral 8x7B used 8 expert networks of ~7B parameters each, with top-2 routing. Despite having 47B total parameters, each token only uses ~13B, making it faster than a dense 47B model while performing comparably to GPT-3.5.

Load Balancing in MoE

A major challenge: the router might send most tokens to the same few experts, leaving others unused. Solutions:

Open-Source LLMs

The LLM landscape evolved from closed-source dominance to a vibrant open-source ecosystem:

Model Organization Parameters Key Innovation
LLaMA (2023) Meta 7B–65B Chinchilla-optimal training
Mistral 7B (2023) Mistral AI 7B Grouped-query attention, sliding window
LLaMA 2 (2023) Meta 7B–70B Extended context, RLHF
Mixtral (2023) Mistral AI 8×7B MoE Sparse MoE, top-2 routing
LLaMA 3 (2024) Meta 8B–405B 15T tokens, massive scale
DeepSeek (2024–25) DeepSeek 7B–671B MoE MoE + efficient training

LLaMA was particularly influential: by training smaller models for longer on more data (following Chinchilla’s insight), Meta showed that a 13B model could match GPT-3’s performance. This democratized LLM research.

Key Architectural Improvements in Modern LLMs

Grouped-Query Attention (GQA)

Standard multi-head attention stores separate K and V for each head, which is memory-intensive during inference. GQA shares K and V across groups of query heads:

GQA achieves near-MHA quality with near-MQA speed. Used in LLaMA 2, Mistral, and many modern LLMs.

Rotary Position Embeddings (RoPE)

Su et al. (2021) introduced RoPE, which encodes position information by rotating query and key vectors:

\[\text{RoPE}(x_m, m) = x_m e^{im\theta}\]

RoPE naturally captures relative positions (the attention between positions $m$ and $n$ depends only on $m - n$) and generalizes better to unseen sequence lengths than absolute position embeddings.

SwiGLU Activation

As discussed in Chapter 2, SwiGLU combines the SiLU activation with a gated linear unit:

\[\text{SwiGLU}(x) = \text{SiLU}(xW_1) \otimes (xW_2)\]

Used in LLaMA, PaLM, and most modern LLMs, SwiGLU consistently outperforms standard FFN layers.

RMSNorm

A simplified version of Layer Normalization that removes the mean-centering step:

\[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma\]

Slightly faster than LayerNorm with no performance loss. Standard in LLaMA.

The Modern LLM Recipe

Combining all advances, a modern LLM typically uses:

Component Choice
Architecture Decoder-only Transformer
Attention Grouped-query attention
Position RoPE
FFN SwiGLU
Normalization RMSNorm (pre-norm)
Activation SiLU (in SwiGLU)
Training objective Autoregressive next-token prediction
Data Trillions of tokens from diverse sources
Alignment RLHF or DPO

Code Example: Scaled Dot-Product with RoPE

# See code/ch12_llm.py for full implementation
import torch

def apply_rotary_pos_emb(x, cos, sin):
    """Apply rotary position embeddings to queries or keys."""
    x1, x2 = x[..., ::2], x[..., 1::2]
    return torch.cat([
        x1 * cos - x2 * sin,
        x1 * sin + x2 * cos,
    ], dim=-1)

Key Takeaways


Previous Chapter: Self-Supervised Learning

Next Chapter: Diffusion Models — A New Paradigm for Generation

Back to Table of Contents