Chapter 10: Transfer Learning and Foundation Models

The Old Way: Train From Scratch

Before 2018, the standard approach for any new task was:

Collect a labeled dataset for your specific problem
Design a neural network architecture
Initialize weights randomly
Train from scratch on your data
Hope you have enough data (usually you didn’t)

This approach required enormous labeled datasets, significant computational resources, and deep expertise in neural network design. It was wasteful: every model started from zero, learning basic visual features or language patterns from scratch.

The Transfer Learning Revolution

The Key Insight

Lower layers of neural networks learn general features — edges, textures, and shapes for vision; syntax, word relationships, and grammar for language. These features are useful across many tasks. Only the higher layers need to be task-specific.

Why not train once on a large dataset, learn these general features, and then transfer them to new tasks?

ImageNet Pretraining for Computer Vision (2014–)

The practice of transfer learning took off in computer vision:

Pretrain a CNN (ResNet, VGG, etc.) on ImageNet (1.2M images, 1000 classes)
Remove the final classification layer
Add a new classification head for your task
Fine-tune the entire network (or just the head) on your smaller dataset

This approach worked astonishingly well:

Approach	Data Needed	Performance
Train from scratch	100,000+ images	Moderate
Fine-tune ImageNet model	1,000–10,000 images	Often better
Fine-tune, freeze lower layers	100–1,000 images	Surprisingly good

Feature Extraction vs. Fine-Tuning

Feature extraction: Freeze all pretrained layers and only train the new head. The pretrained network is used as a fixed feature extractor.

Fine-tuning: Unfreeze some or all pretrained layers and train with a small learning rate. This allows the pretrained features to adapt to the new domain.

Gradual unfreezing: Start by training only the head, then progressively unfreeze deeper layers. This prevents catastrophic forgetting of pretrained features.

Word Embeddings: The First Language Transfer (2013–2016)

Before Transformers, the first form of language transfer was through word embeddings:

Word2Vec (2013)

Mikolov et al. trained shallow neural networks on massive text corpora to learn dense vector representations of words. Two training strategies:

CBOW (Continuous Bag of Words): Predict a word from its context
Skip-gram: Predict the context from a word

The resulting vectors captured semantic relationships:

\[\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\]

GloVe (2014)

Pennington et al. learned word vectors from global co-occurrence statistics. The objective ensures that the dot product of two word vectors is proportional to the log probability of their co-occurrence.

The Limitation

Word embeddings provide a single representation per word, regardless of context. “Bank” (financial institution) and “bank” (river bank) get the same vector. This motivated the development of contextual embeddings.

ELMo: Contextualized Word Representations (2018)

Peters et al. introduced ELMo (Embeddings from Language Models): train a deep bidirectional LSTM language model, and use the hidden states from all layers as the word representation.

\[\text{ELMo}_k = \gamma \sum_{l=0}^{L} s_l \cdot h_{k,l}\]

where $h_{k,l}$ is the hidden state for word $k$ at layer $l$, $s_l$ are learnable mixture weights, and $\gamma$ is a task-specific scaling factor.

ELMo provided different representations for the same word in different contexts, capturing polysemy. Adding ELMo features improved performance on virtually every NLP benchmark.

ULMFiT: Transfer Learning for Text (2018)

Howard and Ruder’s Universal Language Model Fine-tuning showed how to effectively transfer a pretrained language model to classification tasks. Key innovations:

Discriminative fine-tuning: Use different learning rates for different layers (lower layers = smaller rate)
Slanted triangular learning rates: Warmup followed by linear decay
Gradual unfreezing: Unfreeze one layer at a time, from top to bottom

These techniques prevented catastrophic forgetting and made fine-tuning language models practical with very little labeled data.

The Foundation Model Paradigm

BERT: Bidirectional Transformers (2018)

Devlin et al. from Google introduced BERT (Bidirectional Encoder Representations from Transformers), which changed the landscape of NLP. BERT is covered in detail in Chapter 11, but its role as a foundation model is key here.

BERT introduced the modern transfer learning pipeline for NLP:

Pre-train a large Transformer on massive unlabeled text with self-supervised objectives
Fine-tune with a simple task-specific head on a small labeled dataset

This pattern achieved state-of-the-art results on 11 NLP benchmarks simultaneously, often with just a few thousand labeled examples.

GPT: Generative Pre-Training (2018)

OpenAI’s GPT took a different approach:

Pre-train a Transformer decoder with autoregressive language modeling (predict the next word)
Fine-tune on downstream tasks

GPT showed that generative pretraining produces excellent features for discriminative tasks too. The GPT line (GPT-2, GPT-3, GPT-4) would eventually shift from fine-tuning to in-context learning (Chapter 12).

The Foundation Model Concept

Bommasani et al. (2021) coined the term foundation model: a model trained on broad data at scale that can be adapted to a wide range of downstream tasks.

Properties of foundation models:

Emergence: Capabilities that appear from scale but aren’t explicitly trained
Homogenization: A single model architecture serves many tasks
Adaptation: Fine-tuning, prompting, or in-context learning for specific tasks

Adaptation Strategies for Foundation Models

Full Fine-Tuning

Update all model parameters on the downstream dataset. Most effective when you have enough data, but expensive for very large models.

Linear Probing

Freeze the entire model and train only a linear classifier on top. Surprisingly effective for well-trained foundation models.

Parameter-Efficient Fine-Tuning (PEFT)

For models with billions of parameters, full fine-tuning is prohibitively expensive. PEFT methods modify only a small fraction of parameters:

LoRA (2021)

Low-Rank Adaptation freezes pretrained weights and adds trainable low-rank decomposition matrices:

\[W' = W + BA\]

where $W \in \mathbb{R}^{d \times k}$, $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.

For a model with billions of parameters, LoRA might train only 0.1–1% of the total parameters with minimal performance loss.

Prefix Tuning / Prompt Tuning

Add learnable “virtual tokens” to the input, keeping the entire model frozen. The model’s behavior is controlled entirely through these learned prefixes.

Adapters

Insert small trainable bottleneck modules between frozen pretrained layers:

\[\text{Adapter}(x) = x + f(xW_{\text{down}})W_{\text{up}}\]

where $W_{\text{down}} \in \mathbb{R}^{d \times r}$ and $W_{\text{up}} \in \mathbb{R}^{r \times d}$ with $r \ll d$.

Comparison of Adaptation Methods

Method	Parameters Trained	Performance	Memory	Use Case
Full fine-tuning	100%	Best	High	Sufficient data + compute
Linear probing	<0.01%	Good	Low	Quick evaluation
LoRA	0.1–1%	Near full FT	Low	Production fine-tuning
Adapters	1–5%	Good	Medium	Multi-task serving
Prompt tuning	<0.1%	Moderate	Very low	Few-shot adaptation

Transfer Learning Across Modalities

Vision Transformers (ViT, 2020)

Dosovitskiy et al. showed that Transformers pretrained on large image datasets transfer exceptionally well. ViT splits images into patches and processes them as tokens:

\[\text{ViT}(x) = \text{Transformer}([\text{CLS}; x_1^p E; x_2^p E; ...] + E_{\text{pos}})\]

Pretrained on ImageNet-21k (14M images), ViT fine-tunes to achieve state-of-the-art results on many vision benchmarks with fewer resources than training from scratch.

CLIP: Connecting Vision and Language (2021)

Radford et al. trained an image encoder and a text encoder jointly to match image-text pairs from the internet (400M pairs). CLIP enables:

Zero-shot classification: Classify images by comparing to text descriptions, without any training on the target classes
Cross-modal search: Find images that match a text description
Foundation for multimodal models: CLIP’s image encoder is used in DALL·E, Stable Diffusion, and many other systems

Code Example: Fine-Tuning a Pretrained Model

# See code/ch10_transfer.py for full training pipeline
import torch.nn as nn
from torchvision import models

# Load pretrained ResNet, replace final layer
model = models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False  # Freeze all layers

# Replace classifier head
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only the new head will be trained
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

The Paradigm Shift

Transfer learning fundamentally changed how deep learning is practiced:

Old Paradigm	New Paradigm
Train from scratch	Pretrain + fine-tune
Need large labeled dataset	Need small labeled dataset
Design task-specific architecture	Use general foundation model
Each task starts from zero	Knowledge transfers between tasks
Compute at training time	Compute amortized across users

The realization that a single large pretrained model can serve as the starting point for virtually any task is one of the most important insights in modern AI.

Key Takeaways

Transfer learning reuses features learned on one task for another, dramatically reducing data requirements
ImageNet pretraining (2014+) became standard for computer vision
Word2Vec and GloVe provided the first language transfer through word embeddings
ELMo (2018) introduced contextual embeddings from language models
BERT and GPT (2018) established the pretrain→fine-tune paradigm for NLP
Foundation models are trained on broad data at scale and adapted to many tasks
LoRA and other PEFT methods enable efficient adaptation of billion-parameter models
CLIP (2021) connected vision and language through contrastive pretraining

Previous Chapter: Generative Adversarial Networks

Next Chapter: Self-Supervised Learning — Learning Without Labels

Back to Table of Contents