Chapter 8: Attention Mechanisms and the Transformer (2014–2017)

The Bottleneck That Needed Breaking

Recall from Chapter 5: in sequence-to-sequence models, an LSTM encoder compresses an entire input sequence into a single fixed-size vector, and the decoder generates output from that vector. For short sentences, this works. For long sentences, critical information is inevitably lost.

The analogy: imagine summarizing a 500-page book into a single paragraph, and then asking someone to reconstruct the book from that paragraph. Some information will be lost, no matter how good the summarizer.

Bahdanau Attention: The First Fix (2014)

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed a simple but profound idea: instead of compressing the entire input into one vector, let the decoder look back at all encoder hidden states at each decoding step, paying more attention to the relevant parts.

The Mechanism

At each decoding step $t$:

Compute alignment scores between the decoder state $s_t$ and each encoder hidden state $h_i$:

\[e_{ti} = a(s_{t-1}, h_i)\]

where $a$ is a learned alignment function (a small neural network).

Normalize to get attention weights:

\[\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_j \exp(e_{tj})} = \text{softmax}(e_t)_i\]

Compute context vector as a weighted sum of encoder states:

\[c_t = \sum_{i} \alpha_{ti} h_i\]

Use context for decoding:

\[s_t = \text{RNN}(s_{t-1}, [y_{t-1}; c_t])\]

Why This Was Revolutionary

The decoder can selectively focus on different parts of the input at each step
No information bottleneck — all encoder states remain accessible
The attention weights are interpretable: you can visualize which input words the model focuses on when generating each output word
Translation quality improved dramatically, especially for long sentences

Luong Attention Variants (2015)

Luong et al. explored different ways to compute alignment scores:

Variant	Score Function	Properties
Dot product	$e_{ti} = s_t^\top h_i$	Simplest, no parameters
General	$e_{ti} = s_t^\top W h_i$	Learned bilinear form
Concat	$e_{ti} = v^\top \tanh(W[s_t; h_i])$	Bahdanau’s original

The dot product is the simplest and fastest. It works when the encoder and decoder hidden states have the same dimension.

Scaled Dot-Product Attention

For high-dimensional vectors, dot products grow in magnitude, pushing softmax into saturated regions with near-zero gradients. The fix is to scale by the square root of the dimension:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]

where:

$Q$ (Query): “what am I looking for?”
$K$ (Key): “what do I contain?”
$V$ (Value): “what information do I provide?”

The scaling factor $\frac{1}{\sqrt{d_k}}$ keeps the dot products in a range where softmax has meaningful gradients. This is a small detail with large practical impact.

Self-Attention: Attending to Yourself (2016–2017)

The next leap: instead of attending from decoder to encoder, let a sequence attend to itself. Each position computes attention over all positions in the same sequence.

For a sequence of token embeddings $X$:

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]

Each token generates a query (“what am I looking for?”), a key (“what do I represent?”), and a value (“what information do I carry?”). The attention output for each token is a weighted combination of all values, where the weights depend on the query-key compatibility.

What Self-Attention Captures

Token 5 attending to token 2: “The word at position 5 depends on the word at position 2”
In the sentence “The cat sat on the mat because it was tired,” self-attention at “it” should attend strongly to “cat”
Each layer of self-attention builds increasingly abstract relationships between positions

Advantages Over RNNs

Property	RNN	Self-Attention
Captures long-range dependencies	Poorly (gradient decay)	Directly (any two positions interact)
Parallelizable	No (sequential)	Yes (all positions computed in parallel)
Computational complexity per layer	$O(n \cdot d^2)$	$O(n^2 \cdot d)$
Path length between positions	$O(n)$	$O(1)$

The trade-off: self-attention is $O(n^2)$ in sequence length, which becomes expensive for very long sequences. This limitation would motivate later work on efficient attention (Chapter 15).

Multi-Head Attention

Instead of computing a single attention function, compute $h$ attention functions in parallel, each with different learned projections:

$\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)$ $\text{MultiHead}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W_O$

Each head can learn to attend to different types of relationships:

Head 1 might capture syntactic dependencies (subject-verb)
Head 2 might capture coreference (pronoun-antecedent)
Head 3 might capture positional relationships (adjacent words)

In the original Transformer, $d_{\text{model}} = 512$ with $h = 8$ heads, so each head operates in $d_k = 64$ dimensions.

The Transformer: “Attention Is All You Need” (2017)

Vaswani et al. from Google published the landmark paper that replaced recurrence entirely with attention. The Transformer architecture became the foundation for virtually all modern deep learning.

Architecture Overview

                    Output Probabilities
                          ↑
                      Linear + Softmax
                          ↑
                  ┌── Decoder Block ──┐  × N
                  │  Masked Self-Attn │
                  │  Cross-Attention  │
                  │  Feed-Forward     │
                  │  (+ Residual + LN)│
                  └───────────────────┘
                          ↑
                  ┌── Encoder Block ──┐  × N
                  │  Self-Attention   │
                  │  Feed-Forward     │
                  │  (+ Residual + LN)│
                  └───────────────────┘
                          ↑
              Positional Encoding + Embeddings
                          ↑
                    Input Tokens

Key Components

1. Positional Encoding

Self-attention is permutation invariant — it doesn’t know the order of tokens. Positional encodings inject position information using sine and cosine functions of different frequencies:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$ $PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$

The sinusoidal encoding allows the model to learn relative positions and generalize to sequence lengths not seen during training (in theory).

Modern models often use learned positional embeddings or Rotary Position Embeddings (RoPE), which encode relative rather than absolute positions.

2. Encoder Block

Each encoder block has two sub-layers, each wrapped with residual connection and layer normalization:

Multi-head self-attention: Each token attends to all tokens
Position-wise feed-forward network: Two linear transformations with a non-linearity:

\[\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2\]

The FFN operates on each position independently — it’s the “thinking” step after the “communication” step of attention.

3. Decoder Block

The decoder adds a third sub-layer:

Masked self-attention: Each token can only attend to previous tokens (causal masking), preventing information leakage during autoregressive generation
Cross-attention: The decoder attends to the encoder output (queries from decoder, keys/values from encoder)
Feed-forward network: Same as in the encoder

Masking in Detail

Causal mask for autoregressive generation:

\[\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \text{ (allowed)} \\ -\infty & \text{if } j > i \text{ (blocked)} \end{cases}\]

Adding $-\infty$ before softmax ensures zero attention weight to future tokens:

\[\text{softmax}(QK^\top / \sqrt{d_k} + \text{Mask})\]

Why the Transformer Worked So Well

Parallelism: Unlike RNNs, all positions are processed simultaneously, enabling massive GPU parallelism
Direct long-range connections: Any two positions interact in a single layer
Composability: Stacking layers composes increasingly complex interactions
Residual connections everywhere: Gradient flow is excellent at any depth
Scalability: Performance improves predictably with more data, parameters, and compute

Pre-Norm vs. Post-Norm

The original Transformer uses post-norm: apply layer normalization after the residual addition.

\[x_{l+1} = \text{LN}(x_l + \text{SubLayer}(x_l))\]

Pre-norm moves normalization before the sub-layer:

\[x_{l+1} = x_l + \text{SubLayer}(\text{LN}(x_l))\]

Pre-norm creates a cleaner residual pathway and is more stable for training very deep Transformers. It is the standard in GPT-2/3/4, LLaMA, and most modern LLMs.

The Impact

The Transformer’s impact on deep learning cannot be overstated:

Machine translation: Transformer models surpassed all previous approaches
NLP: BERT, GPT, and all modern language models are Transformers
Computer vision: Vision Transformer (ViT, 2020) showed Transformers can match or beat CNNs for image classification
Audio: Whisper, speech recognition systems
Protein folding: AlphaFold 2 uses Transformers
Multimodal: CLIP, DALL·E, Flamingo combine vision and language through Transformers

The Transformer didn’t just improve on RNNs — it became the universal architecture for deep learning.

Code Example: Scaled Dot-Product Attention

# See code/ch08_transformer.py for full implementation
import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

Key Takeaways

Bahdanau attention (2014) eliminated the sequence-to-sequence bottleneck by letting the decoder look at all encoder states
Scaled dot-product attention prevents softmax saturation with the $\frac{1}{\sqrt{d_k}}$ scaling
Self-attention lets each position attend to all other positions, capturing any-range dependencies in $O(1)$ path length
Multi-head attention learns multiple types of relationships in parallel
The Transformer (2017) replaced recurrence entirely with attention, enabling massive parallelism
Positional encodings inject order information into the permutation-invariant attention mechanism
Pre-norm residual connections improve training stability for deep Transformers
The Transformer became the universal architecture across NLP, vision, audio, and science

Previous Chapter: Residual Networks and Skip Connections

Next Chapter: Generative Adversarial Networks

Back to Table of Contents