Supervised learning requires labeled data, and labels are expensive. ImageNet took years of human annotation effort. Medical image datasets require expert doctors. Sentiment analysis needs human readers. At internet scale, labeling even a fraction of available data is impossible.
Yet the internet contains billions of images, trillions of words, and millions of hours of video. Self-supervised learning asks: can we learn powerful representations from this unlabeled data?
The answer, it turned out, was a resounding yes — and self-supervised pretraining became the engine behind the most powerful AI models ever built.
Self-supervised learning generates supervisory signals from the data itself by hiding part of the input and training the model to predict the hidden part:
No human labels required. The structure of the data provides the supervision.
Devlin et al. introduced BERT (Bidirectional Encoder Representations from Transformers) with two pretraining objectives:
Randomly mask 15% of tokens and train the model to predict them:
Input: The [MASK] sat on the [MASK]
Target: The cat sat on the mat
The model must use bidirectional context (both left and right) to fill in the blanks, learning deep contextual representations.
Details of the masking strategy:
This prevents the model from learning to only predict [MASK] tokens.
Given two sentences, predict whether the second follows the first in the original text. Later work (RoBERTa) showed this objective is unnecessary and sometimes harmful.
BERT set new state-of-the-art results on 11 NLP tasks simultaneously, often by large margins.
Liu et al. showed that BERT was significantly undertrained. With better training recipes:
RoBERTa improved over BERT on every benchmark without any architectural changes, highlighting the importance of training methodology.
While BERT masks tokens bidirectionally, GPT uses a simpler objective: predict the next token given all previous tokens (autoregressive language modeling):
\[P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, ..., x_{i-1})\]OpenAI scaled GPT to 1.5B parameters and trained on WebText (40GB of internet text). GPT-2 demonstrated surprising zero-shot capabilities:
| Feature | Autoregressive (GPT) | Masked (BERT) |
|---|---|---|
| Training objective | Predict next token | Predict masked tokens |
| Context | Left-to-right only | Bidirectional |
| Generation | Natural (token by token) | Awkward (iterative refinement) |
| Understanding | Good, but uni-directional | Excellent |
| Best for | Generation, few-shot | Classification, NLU |
Raffel et al. unified all NLP tasks into a single text-to-text format:
T5 used a span corruption pretraining objective — masking contiguous spans of tokens rather than individual tokens. This was a generalization of BERT’s MLM that proved more effective.
Chen et al. introduced a simple contrastive framework for visual self-supervised learning:
The contrastive loss (NT-Xent):
\[\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}\]where $\text{sim}(u, v) = \frac{u \cdot v}{|u| |v|}$ is cosine similarity and $\tau$ is a temperature parameter.
Key insight: The quality of augmentations matters enormously. Strong augmentations (random crop, color distortion, Gaussian blur) force the model to learn semantic features rather than trivial shortcuts.
He et al. addressed SimCLR’s requirement for very large batch sizes by maintaining a queue of negative examples and a momentum encoder:
\[\theta_k \leftarrow m \cdot \theta_k + (1 - m) \cdot \theta_q\]The momentum encoder (with $m = 0.999$) provides slowly evolving, consistent negative examples. This made contrastive learning practical on a single GPU.
Bootstrap Your Own Latent (Grill et al.) showed that contrastive learning works even without negative examples:
This was surprising — without negatives, what prevents the model from collapsing to a constant representation? The asymmetry between the online and target networks, combined with the predictor head, prevents collapse.
Caron et al. combined self-supervised learning with knowledge distillation. A student network and a momentum-updated teacher network both process different augmented views of the same image. The student is trained to match the teacher’s output distribution.
DINO produced remarkable results:
He et al. brought BERT’s masking idea to vision: mask 75% of image patches and train a Vision Transformer to reconstruct them.
Key design decisions:
MAE is remarkably efficient — processing only 25% of patches means 3–4× speedup — and produces excellent representations for downstream tasks.
Bao et al. combined masked image modeling with discrete visual tokens from a pretrained VQVAE. Instead of predicting raw pixels, the model predicts visual token IDs, which proved more effective.
Baevski et al. applied contrastive learning to speech: mask portions of the audio spectrogram and train the model to predict the masked content. This produced speech representations that could be fine-tuned for speech recognition with very little labeled data — 10 minutes of labeled speech achieved near state-of-the-art results.
Radford et al. learned aligned vision-language representations from 400M image-text pairs:
\[\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[\log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_i, t_j)/\tau)} + \log \frac{\exp(\text{sim}(t_i, v_i)/\tau)}{\sum_j \exp(\text{sim}(t_i, v_j)/\tau)}\right]\]CLIP matches images to their corresponding text descriptions and vice versa. The resulting representations enable:
| Domain | Method | Pretext Task | Key Innovation |
|---|---|---|---|
| NLP | BERT | Masked LM | Bidirectional context |
| NLP | GPT | Next token prediction | Autoregressive scale |
| NLP | T5 | Span corruption | Unified text-to-text |
| Vision | SimCLR | Contrastive (augmentations) | Strong augmentations |
| Vision | MoCo | Contrastive (momentum) | Memory queue |
| Vision | MAE | Masked patch prediction | 75% masking, asymmetric |
| Audio | wav2vec | Masked spectrogram | Speech from 10min labels |
| Multimodal | CLIP | Image-text contrastive | Zero-shot transfer |
# See code/ch11_contrastive.py for full pipeline
import torch
import torch.nn.functional as F
def contrastive_loss(z_i, z_j, temperature=0.5):
"""NT-Xent loss for a batch of positive pairs."""
batch_size = z_i.size(0)
z = torch.cat([z_i, z_j], dim=0) # 2N × D
sim = F.cosine_similarity(z.unsqueeze(1), z.unsqueeze(0), dim=2)
sim = sim / temperature
# Mask out self-similarity
mask = ~torch.eye(2 * batch_size, dtype=bool, device=z.device)
sim = sim.masked_select(mask).view(2 * batch_size, -1)
# Positive pairs
labels = torch.arange(batch_size, device=z.device)
labels = torch.cat([labels + batch_size - 1, labels])
return F.cross_entropy(sim, labels)
Previous Chapter: Transfer Learning and Foundation Models
Next Chapter: Scaling Laws and Large Language Models