Chapter 11: Self-Supervised Learning — Learning Without Labels

The Labeling Bottleneck

Supervised learning requires labeled data, and labels are expensive. ImageNet took years of human annotation effort. Medical image datasets require expert doctors. Sentiment analysis needs human readers. At internet scale, labeling even a fraction of available data is impossible.

Yet the internet contains billions of images, trillions of words, and millions of hours of video. Self-supervised learning asks: can we learn powerful representations from this unlabeled data?

The answer, it turned out, was a resounding yes — and self-supervised pretraining became the engine behind the most powerful AI models ever built.

The Core Idea: Create Your Own Labels

Self-supervised learning generates supervisory signals from the data itself by hiding part of the input and training the model to predict the hidden part:

No human labels required. The structure of the data provides the supervision.

Language: Masked Language Modeling

BERT (2018)

Devlin et al. introduced BERT (Bidirectional Encoder Representations from Transformers) with two pretraining objectives:

Masked Language Modeling (MLM)

Randomly mask 15% of tokens and train the model to predict them:

Input:  The [MASK] sat on the [MASK]
Target: The  cat   sat on the  mat

The model must use bidirectional context (both left and right) to fill in the blanks, learning deep contextual representations.

Details of the masking strategy:

This prevents the model from learning to only predict [MASK] tokens.

Next Sentence Prediction (NSP)

Given two sentences, predict whether the second follows the first in the original text. Later work (RoBERTa) showed this objective is unnecessary and sometimes harmful.

BERT Architecture

BERT set new state-of-the-art results on 11 NLP tasks simultaneously, often by large margins.

RoBERTa (2019)

Liu et al. showed that BERT was significantly undertrained. With better training recipes:

RoBERTa improved over BERT on every benchmark without any architectural changes, highlighting the importance of training methodology.

Language: Autoregressive Language Modeling

GPT (2018)

While BERT masks tokens bidirectionally, GPT uses a simpler objective: predict the next token given all previous tokens (autoregressive language modeling):

\[P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, ..., x_{i-1})\]

GPT-2 (2019)

OpenAI scaled GPT to 1.5B parameters and trained on WebText (40GB of internet text). GPT-2 demonstrated surprising zero-shot capabilities:

The Autoregressive vs. Masked Debate

Feature Autoregressive (GPT) Masked (BERT)
Training objective Predict next token Predict masked tokens
Context Left-to-right only Bidirectional
Generation Natural (token by token) Awkward (iterative refinement)
Understanding Good, but uni-directional Excellent
Best for Generation, few-shot Classification, NLU

T5: Text-to-Text Transfer Transformer (2019)

Raffel et al. unified all NLP tasks into a single text-to-text format:

T5 used a span corruption pretraining objective — masking contiguous spans of tokens rather than individual tokens. This was a generalization of BERT’s MLM that proved more effective.

Vision: Contrastive Learning

SimCLR (2020)

Chen et al. introduced a simple contrastive framework for visual self-supervised learning:

  1. Take an image $x$
  2. Apply two random augmentations to get $x_i$ and $x_j$
  3. Encode both through a CNN: $h_i = f(x_i)$, $h_j = f(x_j)$
  4. Project to a lower-dimensional space: $z_i = g(h_i)$, $z_j = g(h_j)$
  5. Train the model to recognize that $z_i$ and $z_j$ are from the same image

The contrastive loss (NT-Xent):

\[\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}\]

where $\text{sim}(u, v) = \frac{u \cdot v}{|u| |v|}$ is cosine similarity and $\tau$ is a temperature parameter.

Key insight: The quality of augmentations matters enormously. Strong augmentations (random crop, color distortion, Gaussian blur) force the model to learn semantic features rather than trivial shortcuts.

MoCo: Momentum Contrast (2019)

He et al. addressed SimCLR’s requirement for very large batch sizes by maintaining a queue of negative examples and a momentum encoder:

\[\theta_k \leftarrow m \cdot \theta_k + (1 - m) \cdot \theta_q\]

The momentum encoder (with $m = 0.999$) provides slowly evolving, consistent negative examples. This made contrastive learning practical on a single GPU.

BYOL and Non-Contrastive Methods (2020)

Bootstrap Your Own Latent (Grill et al.) showed that contrastive learning works even without negative examples:

This was surprising — without negatives, what prevents the model from collapsing to a constant representation? The asymmetry between the online and target networks, combined with the predictor head, prevents collapse.

DINO: Self-Distillation (2021)

Caron et al. combined self-supervised learning with knowledge distillation. A student network and a momentum-updated teacher network both process different augmented views of the same image. The student is trained to match the teacher’s output distribution.

DINO produced remarkable results:

Vision: Masked Image Modeling

MAE: Masked Autoencoders (2021)

He et al. brought BERT’s masking idea to vision: mask 75% of image patches and train a Vision Transformer to reconstruct them.

Key design decisions:

MAE is remarkably efficient — processing only 25% of patches means 3–4× speedup — and produces excellent representations for downstream tasks.

BEiT (2021)

Bao et al. combined masked image modeling with discrete visual tokens from a pretrained VQVAE. Instead of predicting raw pixels, the model predicts visual token IDs, which proved more effective.

Audio and Multimodal Self-Supervised Learning

wav2vec 2.0 (2020)

Baevski et al. applied contrastive learning to speech: mask portions of the audio spectrogram and train the model to predict the masked content. This produced speech representations that could be fine-tuned for speech recognition with very little labeled data — 10 minutes of labeled speech achieved near state-of-the-art results.

CLIP: Contrastive Language-Image Pretraining (2021)

Radford et al. learned aligned vision-language representations from 400M image-text pairs:

\[\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[\log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_i, t_j)/\tau)} + \log \frac{\exp(\text{sim}(t_i, v_i)/\tau)}{\sum_j \exp(\text{sim}(t_i, v_j)/\tau)}\right]\]

CLIP matches images to their corresponding text descriptions and vice versa. The resulting representations enable:

The Self-Supervised Learning Landscape

Domain Method Pretext Task Key Innovation
NLP BERT Masked LM Bidirectional context
NLP GPT Next token prediction Autoregressive scale
NLP T5 Span corruption Unified text-to-text
Vision SimCLR Contrastive (augmentations) Strong augmentations
Vision MoCo Contrastive (momentum) Memory queue
Vision MAE Masked patch prediction 75% masking, asymmetric
Audio wav2vec Masked spectrogram Speech from 10min labels
Multimodal CLIP Image-text contrastive Zero-shot transfer

Code Example: Contrastive Loss (SimCLR Style)

# See code/ch11_contrastive.py for full pipeline
import torch
import torch.nn.functional as F

def contrastive_loss(z_i, z_j, temperature=0.5):
    """NT-Xent loss for a batch of positive pairs."""
    batch_size = z_i.size(0)
    z = torch.cat([z_i, z_j], dim=0)  # 2N × D
    sim = F.cosine_similarity(z.unsqueeze(1), z.unsqueeze(0), dim=2)
    sim = sim / temperature
    # Mask out self-similarity
    mask = ~torch.eye(2 * batch_size, dtype=bool, device=z.device)
    sim = sim.masked_select(mask).view(2 * batch_size, -1)
    # Positive pairs
    labels = torch.arange(batch_size, device=z.device)
    labels = torch.cat([labels + batch_size - 1, labels])
    return F.cross_entropy(sim, labels)

Key Takeaways


Previous Chapter: Transfer Learning and Foundation Models

Next Chapter: Scaling Laws and Large Language Models

Back to Table of Contents