Chapter 12: Scaling Laws and Large Language Models

The Surprising Power of Scale

For decades, the deep learning community focused on architectural innovations — better layers, better training tricks, better losses. But starting around 2020, a startling empirical finding changed the conversation: simply making models bigger, training them longer, and feeding them more data produces predictably better performance.

This wasn’t just “more is better” in a vague sense. The improvements followed precise mathematical laws.

Scaling Laws (2020)

Kaplan et al. from OpenAI published “Scaling Laws for Neural Language Models,” showing that the cross-entropy loss of a language model follows a power law with respect to three factors:

\[L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}\]

where:

$N$ = number of parameters (model size)
$D$ = dataset size (number of tokens)
$C$ = compute budget (FLOPs)
$\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, $\alpha_C \approx 0.050$

Key Findings

Performance scales smoothly with model size, data, and compute — no diminishing returns within the tested range
Model size matters most: For a fixed compute budget, it’s better to train a larger model for fewer steps than a smaller model for more steps
Architecture details matter less: Width, depth, and head count have secondary effects compared to total parameter count
The relationship is predictable: You can forecast the performance of a model you haven’t trained yet

Chinchilla Scaling Laws (2022)

Hoffmann et al. from DeepMind refined these scaling laws and found that Kaplan et al. were undertrained — they used too many parameters relative to the amount of data:

Chinchilla’s rule: For compute-optimal training, the number of tokens should scale proportionally to the number of parameters:

\[D_{\text{optimal}} \approx 20 \times N\]

A 70B parameter model should see approximately 1.4 trillion tokens. Many existing models (including GPT-3 and PaLM) were trained with too few tokens relative to their size.

This insight led to Chinchilla (70B parameters, 1.4T tokens), which outperformed Gopher (280B parameters, 300B tokens) while being 4× smaller. It also influenced LLaMA’s training strategy.

The GPT Series: A Case Study in Scaling

GPT-1 (2018)

117M parameters, 12 layers
Trained on BookCorpus (~5GB of text)
Demonstrated that generative pretraining transfers to discriminative tasks
Fine-tuned on specific tasks

GPT-2 (2019)

1.5B parameters, 48 layers
Trained on WebText (~40GB)
Zero-shot capabilities: Could perform tasks (summarization, translation) without any task-specific fine-tuning
Generated text quality was a step-change improvement

GPT-3 (2020)

175B parameters, 96 layers
Trained on a filtered internet corpus (~570GB)
In-context learning: Give the model a few examples in the prompt, and it performs the task without any gradient updates

Translate English to French:
sea otter => loutre de mer
plush toy => jouet en peluche
cheese => 

GPT-3 produces: “fromage”

Few-Shot, One-Shot, Zero-Shot

Setting	Description	GPT-3 Performance
Zero-shot	Task description only	Good
One-shot	One example	Better
Few-shot	5–100 examples	Often competitive with fine-tuned models

This was profound: GPT-3 could perform tasks it was never explicitly trained for, simply from examples in the prompt. No gradient updates, no task-specific data.

InstructGPT and ChatGPT (2022)

GPT-3 was powerful but uncontrolled — it would follow harmful instructions, produce toxic content, or confidently make up facts. OpenAI addressed this through Reinforcement Learning from Human Feedback (RLHF):

Supervised fine-tuning: Train on human-written demonstrations of helpful responses
Reward model training: Train a model to predict which response humans prefer
PPO optimization: Use the reward model to optimize the language model with Proximal Policy Optimization

The result was InstructGPT (and later ChatGPT), which was more helpful, less harmful, and better at following instructions — despite having fewer parameters than GPT-3.

GPT-4 (2023)

GPT-4 was a massive leap:

Multimodal: Accepts both text and image inputs
Dramatically improved reasoning, coding, and knowledge
Passed professional exams (bar exam, medical licensing) at or above human level
Architecture details remain proprietary, but likely uses a Mixture of Experts design

Emergent Abilities

Wei et al. (2022) documented emergent abilities — capabilities that appear suddenly at certain model scales:

Arithmetic: GPT-3 (175B) could do multi-digit addition; smaller models couldn’t
Chain-of-thought reasoning: Only works above ~100B parameters
Code generation: Improves dramatically beyond ~10B parameters

These abilities don’t improve gradually — they appear to jump from near-zero to competent at critical scale thresholds. (Though later work debated whether this is an artifact of evaluation metrics.)

Mixture of Experts (MoE)

The Efficiency Problem

A 175B parameter model requires enormous compute for every token. But does every token really need all 175B parameters? A question about physics shouldn’t activate the same parameters as one about cooking.

The Solution: Sparse Activation

A Mixture of Experts model has many “expert” sub-networks but only activates a subset for each input:

\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]

where $g(x)$ is a gating function (a router) that selects which experts to activate:

\[g(x) = \text{TopK}(\text{softmax}(W_g x))\]

Switch Transformer (2021)

Fedus et al. simplified MoE by routing each token to exactly one expert (top-1 routing). With 1.6 trillion parameters but only activating ~100B per token, the Switch Transformer achieved better performance than dense models at lower computational cost.

Mixtral (2023)

Mistral AI’s Mixtral 8x7B used 8 expert networks of ~7B parameters each, with top-2 routing. Despite having 47B total parameters, each token only uses ~13B, making it faster than a dense 47B model while performing comparably to GPT-3.5.

Load Balancing in MoE

A major challenge: the router might send most tokens to the same few experts, leaving others unused. Solutions:

Auxiliary load-balancing loss: Penalizes uneven expert utilization
Random routing: Add noise to routing decisions during training
Expert capacity limits: Cap the number of tokens per expert

Open-Source LLMs

The LLM landscape evolved from closed-source dominance to a vibrant open-source ecosystem:

Model	Organization	Parameters	Key Innovation
LLaMA (2023)	Meta	7B–65B	Chinchilla-optimal training
Mistral 7B (2023)	Mistral AI	7B	Grouped-query attention, sliding window
LLaMA 2 (2023)	Meta	7B–70B	Extended context, RLHF
Mixtral (2023)	Mistral AI	8×7B MoE	Sparse MoE, top-2 routing
LLaMA 3 (2024)	Meta	8B–405B	15T tokens, massive scale
DeepSeek (2024–25)	DeepSeek	7B–671B MoE	MoE + efficient training

LLaMA was particularly influential: by training smaller models for longer on more data (following Chinchilla’s insight), Meta showed that a 13B model could match GPT-3’s performance. This democratized LLM research.

Key Architectural Improvements in Modern LLMs

Grouped-Query Attention (GQA)

Standard multi-head attention stores separate K and V for each head, which is memory-intensive during inference. GQA shares K and V across groups of query heads:

Multi-Head Attention (MHA): 32 Q heads, 32 K heads, 32 V heads
Grouped-Query Attention (GQA): 32 Q heads, 8 K heads, 8 V heads
Multi-Query Attention (MQA): 32 Q heads, 1 K head, 1 V head

GQA achieves near-MHA quality with near-MQA speed. Used in LLaMA 2, Mistral, and many modern LLMs.

Rotary Position Embeddings (RoPE)

Su et al. (2021) introduced RoPE, which encodes position information by rotating query and key vectors:

\[\text{RoPE}(x_m, m) = x_m e^{im\theta}\]

RoPE naturally captures relative positions (the attention between positions $m$ and $n$ depends only on $m - n$) and generalizes better to unseen sequence lengths than absolute position embeddings.

SwiGLU Activation

As discussed in Chapter 2, SwiGLU combines the SiLU activation with a gated linear unit:

\[\text{SwiGLU}(x) = \text{SiLU}(xW_1) \otimes (xW_2)\]

Used in LLaMA, PaLM, and most modern LLMs, SwiGLU consistently outperforms standard FFN layers.

RMSNorm

A simplified version of Layer Normalization that removes the mean-centering step:

\[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma\]

Slightly faster than LayerNorm with no performance loss. Standard in LLaMA.

The Modern LLM Recipe

Combining all advances, a modern LLM typically uses:

Component	Choice
Architecture	Decoder-only Transformer
Attention	Grouped-query attention
Position	RoPE
FFN	SwiGLU
Normalization	RMSNorm (pre-norm)
Activation	SiLU (in SwiGLU)
Training objective	Autoregressive next-token prediction
Data	Trillions of tokens from diverse sources
Alignment	RLHF or DPO

Code Example: Scaled Dot-Product with RoPE

# See code/ch12_llm.py for full implementation
import torch

def apply_rotary_pos_emb(x, cos, sin):
    """Apply rotary position embeddings to queries or keys."""
    x1, x2 = x[..., ::2], x[..., 1::2]
    return torch.cat([
        x1 * cos - x2 * sin,
        x1 * sin + x2 * cos,
    ], dim=-1)

Key Takeaways

Scaling laws (2020) showed that LLM performance follows predictable power laws with model size, data, and compute
Chinchilla (2022) showed most models were undertrained: the optimal data-to-parameters ratio is ~20:1
GPT-3 (2020) demonstrated in-context learning: performing tasks from prompt examples without fine-tuning
RLHF (2022) aligned language models with human preferences, creating helpful assistants like ChatGPT
Emergent abilities appear at critical scale thresholds
Mixture of Experts enables larger models with constant inference cost
Modern LLMs combine many architectural innovations: GQA, RoPE, SwiGLU, RMSNorm
Open-source models (LLaMA, Mistral) democratized LLM research

Previous Chapter: Self-Supervised Learning

Next Chapter: Diffusion Models — A New Paradigm for Generation

Back to Table of Contents