For decades, the deep learning community focused on architectural innovations — better layers, better training tricks, better losses. But starting around 2020, a startling empirical finding changed the conversation: simply making models bigger, training them longer, and feeding them more data produces predictably better performance.
This wasn’t just “more is better” in a vague sense. The improvements followed precise mathematical laws.
Kaplan et al. from OpenAI published “Scaling Laws for Neural Language Models,” showing that the cross-entropy loss of a language model follows a power law with respect to three factors:
\[L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}\]where:
Hoffmann et al. from DeepMind refined these scaling laws and found that Kaplan et al. were undertrained — they used too many parameters relative to the amount of data:
Chinchilla’s rule: For compute-optimal training, the number of tokens should scale proportionally to the number of parameters:
\[D_{\text{optimal}} \approx 20 \times N\]A 70B parameter model should see approximately 1.4 trillion tokens. Many existing models (including GPT-3 and PaLM) were trained with too few tokens relative to their size.
This insight led to Chinchilla (70B parameters, 1.4T tokens), which outperformed Gopher (280B parameters, 300B tokens) while being 4× smaller. It also influenced LLaMA’s training strategy.
Translate English to French:
sea otter => loutre de mer
plush toy => jouet en peluche
cheese =>
GPT-3 produces: “fromage”
| Setting | Description | GPT-3 Performance |
|---|---|---|
| Zero-shot | Task description only | Good |
| One-shot | One example | Better |
| Few-shot | 5–100 examples | Often competitive with fine-tuned models |
This was profound: GPT-3 could perform tasks it was never explicitly trained for, simply from examples in the prompt. No gradient updates, no task-specific data.
GPT-3 was powerful but uncontrolled — it would follow harmful instructions, produce toxic content, or confidently make up facts. OpenAI addressed this through Reinforcement Learning from Human Feedback (RLHF):
The result was InstructGPT (and later ChatGPT), which was more helpful, less harmful, and better at following instructions — despite having fewer parameters than GPT-3.
GPT-4 was a massive leap:
Wei et al. (2022) documented emergent abilities — capabilities that appear suddenly at certain model scales:
These abilities don’t improve gradually — they appear to jump from near-zero to competent at critical scale thresholds. (Though later work debated whether this is an artifact of evaluation metrics.)
A 175B parameter model requires enormous compute for every token. But does every token really need all 175B parameters? A question about physics shouldn’t activate the same parameters as one about cooking.
A Mixture of Experts model has many “expert” sub-networks but only activates a subset for each input:
\[y = \sum_{i=1}^{E} g_i(x) \cdot \text{Expert}_i(x)\]where $g(x)$ is a gating function (a router) that selects which experts to activate:
\[g(x) = \text{TopK}(\text{softmax}(W_g x))\]Fedus et al. simplified MoE by routing each token to exactly one expert (top-1 routing). With 1.6 trillion parameters but only activating ~100B per token, the Switch Transformer achieved better performance than dense models at lower computational cost.
Mistral AI’s Mixtral 8x7B used 8 expert networks of ~7B parameters each, with top-2 routing. Despite having 47B total parameters, each token only uses ~13B, making it faster than a dense 47B model while performing comparably to GPT-3.5.
A major challenge: the router might send most tokens to the same few experts, leaving others unused. Solutions:
The LLM landscape evolved from closed-source dominance to a vibrant open-source ecosystem:
| Model | Organization | Parameters | Key Innovation |
|---|---|---|---|
| LLaMA (2023) | Meta | 7B–65B | Chinchilla-optimal training |
| Mistral 7B (2023) | Mistral AI | 7B | Grouped-query attention, sliding window |
| LLaMA 2 (2023) | Meta | 7B–70B | Extended context, RLHF |
| Mixtral (2023) | Mistral AI | 8×7B MoE | Sparse MoE, top-2 routing |
| LLaMA 3 (2024) | Meta | 8B–405B | 15T tokens, massive scale |
| DeepSeek (2024–25) | DeepSeek | 7B–671B MoE | MoE + efficient training |
LLaMA was particularly influential: by training smaller models for longer on more data (following Chinchilla’s insight), Meta showed that a 13B model could match GPT-3’s performance. This democratized LLM research.
Standard multi-head attention stores separate K and V for each head, which is memory-intensive during inference. GQA shares K and V across groups of query heads:
GQA achieves near-MHA quality with near-MQA speed. Used in LLaMA 2, Mistral, and many modern LLMs.
Su et al. (2021) introduced RoPE, which encodes position information by rotating query and key vectors:
\[\text{RoPE}(x_m, m) = x_m e^{im\theta}\]RoPE naturally captures relative positions (the attention between positions $m$ and $n$ depends only on $m - n$) and generalizes better to unseen sequence lengths than absolute position embeddings.
As discussed in Chapter 2, SwiGLU combines the SiLU activation with a gated linear unit:
\[\text{SwiGLU}(x) = \text{SiLU}(xW_1) \otimes (xW_2)\]Used in LLaMA, PaLM, and most modern LLMs, SwiGLU consistently outperforms standard FFN layers.
A simplified version of Layer Normalization that removes the mean-centering step:
\[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma\]Slightly faster than LayerNorm with no performance loss. Standard in LLaMA.
Combining all advances, a modern LLM typically uses:
| Component | Choice |
|---|---|
| Architecture | Decoder-only Transformer |
| Attention | Grouped-query attention |
| Position | RoPE |
| FFN | SwiGLU |
| Normalization | RMSNorm (pre-norm) |
| Activation | SiLU (in SwiGLU) |
| Training objective | Autoregressive next-token prediction |
| Data | Trillions of tokens from diverse sources |
| Alignment | RLHF or DPO |
# See code/ch12_llm.py for full implementation
import torch
def apply_rotary_pos_emb(x, cos, sin):
"""Apply rotary position embeddings to queries or keys."""
x1, x2 = x[..., ::2], x[..., 1::2]
return torch.cat([
x1 * cos - x2 * sin,
x1 * sin + x2 * cos,
], dim=-1)
Previous Chapter: Self-Supervised Learning
Next Chapter: Diffusion Models — A New Paradigm for Generation