Chapter 10: Transfer Learning and Foundation Models

The Old Way: Train From Scratch

Before 2018, the standard approach for any new task was:

  1. Collect a labeled dataset for your specific problem
  2. Design a neural network architecture
  3. Initialize weights randomly
  4. Train from scratch on your data
  5. Hope you have enough data (usually you didn’t)

This approach required enormous labeled datasets, significant computational resources, and deep expertise in neural network design. It was wasteful: every model started from zero, learning basic visual features or language patterns from scratch.

The Transfer Learning Revolution

The Key Insight

Lower layers of neural networks learn general features — edges, textures, and shapes for vision; syntax, word relationships, and grammar for language. These features are useful across many tasks. Only the higher layers need to be task-specific.

Why not train once on a large dataset, learn these general features, and then transfer them to new tasks?

ImageNet Pretraining for Computer Vision (2014–)

The practice of transfer learning took off in computer vision:

  1. Pretrain a CNN (ResNet, VGG, etc.) on ImageNet (1.2M images, 1000 classes)
  2. Remove the final classification layer
  3. Add a new classification head for your task
  4. Fine-tune the entire network (or just the head) on your smaller dataset

This approach worked astonishingly well:

Approach Data Needed Performance
Train from scratch 100,000+ images Moderate
Fine-tune ImageNet model 1,000–10,000 images Often better
Fine-tune, freeze lower layers 100–1,000 images Surprisingly good

Feature Extraction vs. Fine-Tuning

Feature extraction: Freeze all pretrained layers and only train the new head. The pretrained network is used as a fixed feature extractor.

Fine-tuning: Unfreeze some or all pretrained layers and train with a small learning rate. This allows the pretrained features to adapt to the new domain.

Gradual unfreezing: Start by training only the head, then progressively unfreeze deeper layers. This prevents catastrophic forgetting of pretrained features.

Word Embeddings: The First Language Transfer (2013–2016)

Before Transformers, the first form of language transfer was through word embeddings:

Word2Vec (2013)

Mikolov et al. trained shallow neural networks on massive text corpora to learn dense vector representations of words. Two training strategies:

The resulting vectors captured semantic relationships:

\[\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\]

GloVe (2014)

Pennington et al. learned word vectors from global co-occurrence statistics. The objective ensures that the dot product of two word vectors is proportional to the log probability of their co-occurrence.

The Limitation

Word embeddings provide a single representation per word, regardless of context. “Bank” (financial institution) and “bank” (river bank) get the same vector. This motivated the development of contextual embeddings.

ELMo: Contextualized Word Representations (2018)

Peters et al. introduced ELMo (Embeddings from Language Models): train a deep bidirectional LSTM language model, and use the hidden states from all layers as the word representation.

\[\text{ELMo}_k = \gamma \sum_{l=0}^{L} s_l \cdot h_{k,l}\]

where $h_{k,l}$ is the hidden state for word $k$ at layer $l$, $s_l$ are learnable mixture weights, and $\gamma$ is a task-specific scaling factor.

ELMo provided different representations for the same word in different contexts, capturing polysemy. Adding ELMo features improved performance on virtually every NLP benchmark.

ULMFiT: Transfer Learning for Text (2018)

Howard and Ruder’s Universal Language Model Fine-tuning showed how to effectively transfer a pretrained language model to classification tasks. Key innovations:

  1. Discriminative fine-tuning: Use different learning rates for different layers (lower layers = smaller rate)
  2. Slanted triangular learning rates: Warmup followed by linear decay
  3. Gradual unfreezing: Unfreeze one layer at a time, from top to bottom

These techniques prevented catastrophic forgetting and made fine-tuning language models practical with very little labeled data.

The Foundation Model Paradigm

BERT: Bidirectional Transformers (2018)

Devlin et al. from Google introduced BERT (Bidirectional Encoder Representations from Transformers), which changed the landscape of NLP. BERT is covered in detail in Chapter 11, but its role as a foundation model is key here.

BERT introduced the modern transfer learning pipeline for NLP:

  1. Pre-train a large Transformer on massive unlabeled text with self-supervised objectives
  2. Fine-tune with a simple task-specific head on a small labeled dataset

This pattern achieved state-of-the-art results on 11 NLP benchmarks simultaneously, often with just a few thousand labeled examples.

GPT: Generative Pre-Training (2018)

OpenAI’s GPT took a different approach:

  1. Pre-train a Transformer decoder with autoregressive language modeling (predict the next word)
  2. Fine-tune on downstream tasks

GPT showed that generative pretraining produces excellent features for discriminative tasks too. The GPT line (GPT-2, GPT-3, GPT-4) would eventually shift from fine-tuning to in-context learning (Chapter 12).

The Foundation Model Concept

Bommasani et al. (2021) coined the term foundation model: a model trained on broad data at scale that can be adapted to a wide range of downstream tasks.

Properties of foundation models:

Adaptation Strategies for Foundation Models

Full Fine-Tuning

Update all model parameters on the downstream dataset. Most effective when you have enough data, but expensive for very large models.

Linear Probing

Freeze the entire model and train only a linear classifier on top. Surprisingly effective for well-trained foundation models.

Parameter-Efficient Fine-Tuning (PEFT)

For models with billions of parameters, full fine-tuning is prohibitively expensive. PEFT methods modify only a small fraction of parameters:

LoRA (2021)

Low-Rank Adaptation freezes pretrained weights and adds trainable low-rank decomposition matrices:

\[W' = W + BA\]

where $W \in \mathbb{R}^{d \times k}$, $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.

For a model with billions of parameters, LoRA might train only 0.1–1% of the total parameters with minimal performance loss.

Prefix Tuning / Prompt Tuning

Add learnable “virtual tokens” to the input, keeping the entire model frozen. The model’s behavior is controlled entirely through these learned prefixes.

Adapters

Insert small trainable bottleneck modules between frozen pretrained layers:

\[\text{Adapter}(x) = x + f(xW_{\text{down}})W_{\text{up}}\]

where $W_{\text{down}} \in \mathbb{R}^{d \times r}$ and $W_{\text{up}} \in \mathbb{R}^{r \times d}$ with $r \ll d$.

Comparison of Adaptation Methods

Method Parameters Trained Performance Memory Use Case
Full fine-tuning 100% Best High Sufficient data + compute
Linear probing <0.01% Good Low Quick evaluation
LoRA 0.1–1% Near full FT Low Production fine-tuning
Adapters 1–5% Good Medium Multi-task serving
Prompt tuning <0.1% Moderate Very low Few-shot adaptation

Transfer Learning Across Modalities

Vision Transformers (ViT, 2020)

Dosovitskiy et al. showed that Transformers pretrained on large image datasets transfer exceptionally well. ViT splits images into patches and processes them as tokens:

\[\text{ViT}(x) = \text{Transformer}([\text{CLS}; x_1^p E; x_2^p E; ...] + E_{\text{pos}})\]

Pretrained on ImageNet-21k (14M images), ViT fine-tunes to achieve state-of-the-art results on many vision benchmarks with fewer resources than training from scratch.

CLIP: Connecting Vision and Language (2021)

Radford et al. trained an image encoder and a text encoder jointly to match image-text pairs from the internet (400M pairs). CLIP enables:

Code Example: Fine-Tuning a Pretrained Model

# See code/ch10_transfer.py for full training pipeline
import torch.nn as nn
from torchvision import models

# Load pretrained ResNet, replace final layer
model = models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False  # Freeze all layers

# Replace classifier head
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only the new head will be trained
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

The Paradigm Shift

Transfer learning fundamentally changed how deep learning is practiced:

Old Paradigm New Paradigm
Train from scratch Pretrain + fine-tune
Need large labeled dataset Need small labeled dataset
Design task-specific architecture Use general foundation model
Each task starts from zero Knowledge transfers between tasks
Compute at training time Compute amortized across users

The realization that a single large pretrained model can serve as the starting point for virtually any task is one of the most important insights in modern AI.

Key Takeaways


Previous Chapter: Generative Adversarial Networks

Next Chapter: Self-Supervised Learning — Learning Without Labels

Back to Table of Contents