Chapter 11: Self-Supervised Learning — Learning Without Labels

The Labeling Bottleneck

Supervised learning requires labeled data, and labels are expensive. ImageNet took years of human annotation effort. Medical image datasets require expert doctors. Sentiment analysis needs human readers. At internet scale, labeling even a fraction of available data is impossible.

Yet the internet contains billions of images, trillions of words, and millions of hours of video. Self-supervised learning asks: can we learn powerful representations from this unlabeled data?

The answer, it turned out, was a resounding yes — and self-supervised pretraining became the engine behind the most powerful AI models ever built.

The Core Idea: Create Your Own Labels

Self-supervised learning generates supervisory signals from the data itself by hiding part of the input and training the model to predict the hidden part:

Mask some words → predict the masked words (BERT)
Given previous words → predict the next word (GPT)
Augment an image twice → learn that both views represent the same thing (SimCLR)
Add noise to an image → predict the noise (diffusion models)

No human labels required. The structure of the data provides the supervision.

Language: Masked Language Modeling

BERT (2018)

Devlin et al. introduced BERT (Bidirectional Encoder Representations from Transformers) with two pretraining objectives:

Masked Language Modeling (MLM)

Randomly mask 15% of tokens and train the model to predict them:

Input:  The [MASK] sat on the [MASK]
Target: The  cat   sat on the  mat

The model must use bidirectional context (both left and right) to fill in the blanks, learning deep contextual representations.

Details of the masking strategy:

80% of selected tokens → replaced with [MASK]
10% → replaced with a random token
10% → kept unchanged

This prevents the model from learning to only predict [MASK] tokens.

Next Sentence Prediction (NSP)

Given two sentences, predict whether the second follows the first in the original text. Later work (RoBERTa) showed this objective is unnecessary and sometimes harmful.

BERT Architecture

Base: 12 layers, 768 hidden, 12 heads, 110M parameters
Large: 24 layers, 1024 hidden, 16 heads, 340M parameters
Trained on BookCorpus + English Wikipedia (~16GB of text)

BERT set new state-of-the-art results on 11 NLP tasks simultaneously, often by large margins.

RoBERTa (2019)

Liu et al. showed that BERT was significantly undertrained. With better training recipes:

Remove NSP objective
Train longer with more data
Use dynamic masking (different mask each epoch)
Larger batches

RoBERTa improved over BERT on every benchmark without any architectural changes, highlighting the importance of training methodology.

Language: Autoregressive Language Modeling

GPT (2018)

While BERT masks tokens bidirectionally, GPT uses a simpler objective: predict the next token given all previous tokens (autoregressive language modeling):

\[P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, ..., x_{i-1})\]

GPT-2 (2019)

OpenAI scaled GPT to 1.5B parameters and trained on WebText (40GB of internet text). GPT-2 demonstrated surprising zero-shot capabilities:

Text generation that was often coherent and contextually appropriate
Ability to perform tasks it was never explicitly trained for (summarization, translation)
The famous “too dangerous to release” announcement (though it was later released)

The Autoregressive vs. Masked Debate

Feature	Autoregressive (GPT)	Masked (BERT)
Training objective	Predict next token	Predict masked tokens
Context	Left-to-right only	Bidirectional
Generation	Natural (token by token)	Awkward (iterative refinement)
Understanding	Good, but uni-directional	Excellent
Best for	Generation, few-shot	Classification, NLU

T5: Text-to-Text Transfer Transformer (2019)

Raffel et al. unified all NLP tasks into a single text-to-text format:

Translation: “translate English to French: The house is blue” → “La maison est bleue”
Summarization: “summarize: [long article]” → “[summary]”
Classification: “sentiment: This movie was great” → “positive”

T5 used a span corruption pretraining objective — masking contiguous spans of tokens rather than individual tokens. This was a generalization of BERT’s MLM that proved more effective.

Vision: Contrastive Learning

SimCLR (2020)

Chen et al. introduced a simple contrastive framework for visual self-supervised learning:

Take an image $x$
Apply two random augmentations to get $x_i$ and $x_j$
Encode both through a CNN: $h_i = f(x_i)$, $h_j = f(x_j)$
Project to a lower-dimensional space: $z_i = g(h_i)$, $z_j = g(h_j)$
Train the model to recognize that $z_i$ and $z_j$ are from the same image

The contrastive loss (NT-Xent):

\[\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}\]

where $\text{sim}(u, v) = \frac{u \cdot v}{|u| |v|}$ is cosine similarity and $\tau$ is a temperature parameter.

Key insight: The quality of augmentations matters enormously. Strong augmentations (random crop, color distortion, Gaussian blur) force the model to learn semantic features rather than trivial shortcuts.

MoCo: Momentum Contrast (2019)

He et al. addressed SimCLR’s requirement for very large batch sizes by maintaining a queue of negative examples and a momentum encoder:

\[\theta_k \leftarrow m \cdot \theta_k + (1 - m) \cdot \theta_q\]

The momentum encoder (with $m = 0.999$) provides slowly evolving, consistent negative examples. This made contrastive learning practical on a single GPU.

BYOL and Non-Contrastive Methods (2020)

Bootstrap Your Own Latent (Grill et al.) showed that contrastive learning works even without negative examples:

Online network: Predicts the representation of one view from another
Target network: Updated via momentum (like MoCo)
No negative pairs needed

This was surprising — without negatives, what prevents the model from collapsing to a constant representation? The asymmetry between the online and target networks, combined with the predictor head, prevents collapse.

DINO: Self-Distillation (2021)

Caron et al. combined self-supervised learning with knowledge distillation. A student network and a momentum-updated teacher network both process different augmented views of the same image. The student is trained to match the teacher’s output distribution.

DINO produced remarkable results:

Attention maps naturally segment objects without any supervised training
Features transfer well to many downstream tasks
Works with both CNNs and Vision Transformers

Vision: Masked Image Modeling

MAE: Masked Autoencoders (2021)

He et al. brought BERT’s masking idea to vision: mask 75% of image patches and train a Vision Transformer to reconstruct them.

Key design decisions:

High masking ratio (75%): Unlike BERT’s 15%, images have much more redundancy
Asymmetric encoder-decoder: The encoder only processes visible patches (saving computation), and a lightweight decoder reconstructs masked patches
Pixel-level reconstruction: Predict the raw pixel values of masked patches

MAE is remarkably efficient — processing only 25% of patches means 3–4× speedup — and produces excellent representations for downstream tasks.

BEiT (2021)

Bao et al. combined masked image modeling with discrete visual tokens from a pretrained VQVAE. Instead of predicting raw pixels, the model predicts visual token IDs, which proved more effective.

Audio and Multimodal Self-Supervised Learning

wav2vec 2.0 (2020)

Baevski et al. applied contrastive learning to speech: mask portions of the audio spectrogram and train the model to predict the masked content. This produced speech representations that could be fine-tuned for speech recognition with very little labeled data — 10 minutes of labeled speech achieved near state-of-the-art results.

CLIP: Contrastive Language-Image Pretraining (2021)

Radford et al. learned aligned vision-language representations from 400M image-text pairs:

\[\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[\log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_i, t_j)/\tau)} + \log \frac{\exp(\text{sim}(t_i, v_i)/\tau)}{\sum_j \exp(\text{sim}(t_i, v_j)/\tau)}\right]\]

CLIP matches images to their corresponding text descriptions and vice versa. The resulting representations enable:

Zero-shot image classification (describe classes in text)
Image-text retrieval
Foundation for text-to-image generation (DALL·E, Stable Diffusion)

The Self-Supervised Learning Landscape

Domain	Method	Pretext Task	Key Innovation
NLP	BERT	Masked LM	Bidirectional context
NLP	GPT	Next token prediction	Autoregressive scale
NLP	T5	Span corruption	Unified text-to-text
Vision	SimCLR	Contrastive (augmentations)	Strong augmentations
Vision	MoCo	Contrastive (momentum)	Memory queue
Vision	MAE	Masked patch prediction	75% masking, asymmetric
Audio	wav2vec	Masked spectrogram	Speech from 10min labels
Multimodal	CLIP	Image-text contrastive	Zero-shot transfer

Code Example: Contrastive Loss (SimCLR Style)

# See code/ch11_contrastive.py for full pipeline
import torch
import torch.nn.functional as F

def contrastive_loss(z_i, z_j, temperature=0.5):
    """NT-Xent loss for a batch of positive pairs."""
    batch_size = z_i.size(0)
    z = torch.cat([z_i, z_j], dim=0)  # 2N × D
    sim = F.cosine_similarity(z.unsqueeze(1), z.unsqueeze(0), dim=2)
    sim = sim / temperature
    # Mask out self-similarity
    mask = ~torch.eye(2 * batch_size, dtype=bool, device=z.device)
    sim = sim.masked_select(mask).view(2 * batch_size, -1)
    # Positive pairs
    labels = torch.arange(batch_size, device=z.device)
    labels = torch.cat([labels + batch_size - 1, labels])
    return F.cross_entropy(sim, labels)

Key Takeaways

Self-supervised learning generates supervisory signals from the data itself, eliminating the need for human labels
BERT (2018) pioneered masked language modeling, learning bidirectional representations
GPT (2018) used autoregressive language modeling, which proved to scale better
SimCLR and MoCo (2019–2020) showed contrastive learning produces excellent visual features
MAE (2021) brought masking to vision with 75% patch masking
CLIP (2021) connected vision and language through contrastive learning on image-text pairs
Self-supervised pretraining is the key enabler of foundation models — scale the data, scale the model, and powerful capabilities emerge

Previous Chapter: Transfer Learning and Foundation Models

Next Chapter: Scaling Laws and Large Language Models

Back to Table of Contents