Before 2018, the standard approach for any new task was:
This approach required enormous labeled datasets, significant computational resources, and deep expertise in neural network design. It was wasteful: every model started from zero, learning basic visual features or language patterns from scratch.
Lower layers of neural networks learn general features — edges, textures, and shapes for vision; syntax, word relationships, and grammar for language. These features are useful across many tasks. Only the higher layers need to be task-specific.
Why not train once on a large dataset, learn these general features, and then transfer them to new tasks?
The practice of transfer learning took off in computer vision:
This approach worked astonishingly well:
| Approach | Data Needed | Performance |
|---|---|---|
| Train from scratch | 100,000+ images | Moderate |
| Fine-tune ImageNet model | 1,000–10,000 images | Often better |
| Fine-tune, freeze lower layers | 100–1,000 images | Surprisingly good |
Feature extraction: Freeze all pretrained layers and only train the new head. The pretrained network is used as a fixed feature extractor.
Fine-tuning: Unfreeze some or all pretrained layers and train with a small learning rate. This allows the pretrained features to adapt to the new domain.
Gradual unfreezing: Start by training only the head, then progressively unfreeze deeper layers. This prevents catastrophic forgetting of pretrained features.
Before Transformers, the first form of language transfer was through word embeddings:
Mikolov et al. trained shallow neural networks on massive text corpora to learn dense vector representations of words. Two training strategies:
The resulting vectors captured semantic relationships:
\[\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\]Pennington et al. learned word vectors from global co-occurrence statistics. The objective ensures that the dot product of two word vectors is proportional to the log probability of their co-occurrence.
Word embeddings provide a single representation per word, regardless of context. “Bank” (financial institution) and “bank” (river bank) get the same vector. This motivated the development of contextual embeddings.
Peters et al. introduced ELMo (Embeddings from Language Models): train a deep bidirectional LSTM language model, and use the hidden states from all layers as the word representation.
\[\text{ELMo}_k = \gamma \sum_{l=0}^{L} s_l \cdot h_{k,l}\]where $h_{k,l}$ is the hidden state for word $k$ at layer $l$, $s_l$ are learnable mixture weights, and $\gamma$ is a task-specific scaling factor.
ELMo provided different representations for the same word in different contexts, capturing polysemy. Adding ELMo features improved performance on virtually every NLP benchmark.
Howard and Ruder’s Universal Language Model Fine-tuning showed how to effectively transfer a pretrained language model to classification tasks. Key innovations:
These techniques prevented catastrophic forgetting and made fine-tuning language models practical with very little labeled data.
Devlin et al. from Google introduced BERT (Bidirectional Encoder Representations from Transformers), which changed the landscape of NLP. BERT is covered in detail in Chapter 11, but its role as a foundation model is key here.
BERT introduced the modern transfer learning pipeline for NLP:
This pattern achieved state-of-the-art results on 11 NLP benchmarks simultaneously, often with just a few thousand labeled examples.
OpenAI’s GPT took a different approach:
GPT showed that generative pretraining produces excellent features for discriminative tasks too. The GPT line (GPT-2, GPT-3, GPT-4) would eventually shift from fine-tuning to in-context learning (Chapter 12).
Bommasani et al. (2021) coined the term foundation model: a model trained on broad data at scale that can be adapted to a wide range of downstream tasks.
Properties of foundation models:
Update all model parameters on the downstream dataset. Most effective when you have enough data, but expensive for very large models.
Freeze the entire model and train only a linear classifier on top. Surprisingly effective for well-trained foundation models.
For models with billions of parameters, full fine-tuning is prohibitively expensive. PEFT methods modify only a small fraction of parameters:
Low-Rank Adaptation freezes pretrained weights and adds trainable low-rank decomposition matrices:
\[W' = W + BA\]where $W \in \mathbb{R}^{d \times k}$, $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.
For a model with billions of parameters, LoRA might train only 0.1–1% of the total parameters with minimal performance loss.
Add learnable “virtual tokens” to the input, keeping the entire model frozen. The model’s behavior is controlled entirely through these learned prefixes.
Insert small trainable bottleneck modules between frozen pretrained layers:
\[\text{Adapter}(x) = x + f(xW_{\text{down}})W_{\text{up}}\]where $W_{\text{down}} \in \mathbb{R}^{d \times r}$ and $W_{\text{up}} \in \mathbb{R}^{r \times d}$ with $r \ll d$.
| Method | Parameters Trained | Performance | Memory | Use Case |
|---|---|---|---|---|
| Full fine-tuning | 100% | Best | High | Sufficient data + compute |
| Linear probing | <0.01% | Good | Low | Quick evaluation |
| LoRA | 0.1–1% | Near full FT | Low | Production fine-tuning |
| Adapters | 1–5% | Good | Medium | Multi-task serving |
| Prompt tuning | <0.1% | Moderate | Very low | Few-shot adaptation |
Dosovitskiy et al. showed that Transformers pretrained on large image datasets transfer exceptionally well. ViT splits images into patches and processes them as tokens:
\[\text{ViT}(x) = \text{Transformer}([\text{CLS}; x_1^p E; x_2^p E; ...] + E_{\text{pos}})\]Pretrained on ImageNet-21k (14M images), ViT fine-tunes to achieve state-of-the-art results on many vision benchmarks with fewer resources than training from scratch.
Radford et al. trained an image encoder and a text encoder jointly to match image-text pairs from the internet (400M pairs). CLIP enables:
# See code/ch10_transfer.py for full training pipeline
import torch.nn as nn
from torchvision import models
# Load pretrained ResNet, replace final layer
model = models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False # Freeze all layers
# Replace classifier head
model.fc = nn.Linear(model.fc.in_features, num_classes)
# Only the new head will be trained
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
Transfer learning fundamentally changed how deep learning is practiced:
| Old Paradigm | New Paradigm |
|---|---|
| Train from scratch | Pretrain + fine-tune |
| Need large labeled dataset | Need small labeled dataset |
| Design task-specific architecture | Use general foundation model |
| Each task starts from zero | Knowledge transfers between tasks |
| Compute at training time | Compute amortized across users |
The realization that a single large pretrained model can serve as the starting point for virtually any task is one of the most important insights in modern AI.
Previous Chapter: Generative Adversarial Networks
Next Chapter: Self-Supervised Learning — Learning Without Labels