Chapter 8: Attention Mechanisms and the Transformer (2014–2017)

The Bottleneck That Needed Breaking

Recall from Chapter 5: in sequence-to-sequence models, an LSTM encoder compresses an entire input sequence into a single fixed-size vector, and the decoder generates output from that vector. For short sentences, this works. For long sentences, critical information is inevitably lost.

The analogy: imagine summarizing a 500-page book into a single paragraph, and then asking someone to reconstruct the book from that paragraph. Some information will be lost, no matter how good the summarizer.

Bahdanau Attention: The First Fix (2014)

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed a simple but profound idea: instead of compressing the entire input into one vector, let the decoder look back at all encoder hidden states at each decoding step, paying more attention to the relevant parts.

The Mechanism

At each decoding step $t$:

  1. Compute alignment scores between the decoder state $s_t$ and each encoder hidden state $h_i$:
\[e_{ti} = a(s_{t-1}, h_i)\]

where $a$ is a learned alignment function (a small neural network).

  1. Normalize to get attention weights:
\[\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_j \exp(e_{tj})} = \text{softmax}(e_t)_i\]
  1. Compute context vector as a weighted sum of encoder states:
\[c_t = \sum_{i} \alpha_{ti} h_i\]
  1. Use context for decoding:
\[s_t = \text{RNN}(s_{t-1}, [y_{t-1}; c_t])\]

Why This Was Revolutionary

Luong Attention Variants (2015)

Luong et al. explored different ways to compute alignment scores:

Variant Score Function Properties
Dot product $e_{ti} = s_t^\top h_i$ Simplest, no parameters
General $e_{ti} = s_t^\top W h_i$ Learned bilinear form
Concat $e_{ti} = v^\top \tanh(W[s_t; h_i])$ Bahdanau’s original

The dot product is the simplest and fastest. It works when the encoder and decoder hidden states have the same dimension.

Scaled Dot-Product Attention

For high-dimensional vectors, dot products grow in magnitude, pushing softmax into saturated regions with near-zero gradients. The fix is to scale by the square root of the dimension:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]

where:

The scaling factor $\frac{1}{\sqrt{d_k}}$ keeps the dot products in a range where softmax has meaningful gradients. This is a small detail with large practical impact.

Self-Attention: Attending to Yourself (2016–2017)

The next leap: instead of attending from decoder to encoder, let a sequence attend to itself. Each position computes attention over all positions in the same sequence.

For a sequence of token embeddings $X$:

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]

Each token generates a query (“what am I looking for?”), a key (“what do I represent?”), and a value (“what information do I carry?”). The attention output for each token is a weighted combination of all values, where the weights depend on the query-key compatibility.

What Self-Attention Captures

Advantages Over RNNs

Property RNN Self-Attention
Captures long-range dependencies Poorly (gradient decay) Directly (any two positions interact)
Parallelizable No (sequential) Yes (all positions computed in parallel)
Computational complexity per layer $O(n \cdot d^2)$ $O(n^2 \cdot d)$
Path length between positions $O(n)$ $O(1)$

The trade-off: self-attention is $O(n^2)$ in sequence length, which becomes expensive for very long sequences. This limitation would motivate later work on efficient attention (Chapter 15).

Multi-Head Attention

Instead of computing a single attention function, compute $h$ attention functions in parallel, each with different learned projections:

\(\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)\) \(\text{MultiHead}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W_O\)

Each head can learn to attend to different types of relationships:

In the original Transformer, $d_{\text{model}} = 512$ with $h = 8$ heads, so each head operates in $d_k = 64$ dimensions.

The Transformer: “Attention Is All You Need” (2017)

Vaswani et al. from Google published the landmark paper that replaced recurrence entirely with attention. The Transformer architecture became the foundation for virtually all modern deep learning.

Architecture Overview

                    Output Probabilities
                          ↑
                      Linear + Softmax
                          ↑
                  ┌── Decoder Block ──┐  × N
                  │  Masked Self-Attn │
                  │  Cross-Attention  │
                  │  Feed-Forward     │
                  │  (+ Residual + LN)│
                  └───────────────────┘
                          ↑
                  ┌── Encoder Block ──┐  × N
                  │  Self-Attention   │
                  │  Feed-Forward     │
                  │  (+ Residual + LN)│
                  └───────────────────┘
                          ↑
              Positional Encoding + Embeddings
                          ↑
                    Input Tokens

Key Components

1. Positional Encoding

Self-attention is permutation invariant — it doesn’t know the order of tokens. Positional encodings inject position information using sine and cosine functions of different frequencies:

\(PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)\) \(PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)\)

The sinusoidal encoding allows the model to learn relative positions and generalize to sequence lengths not seen during training (in theory).

Modern models often use learned positional embeddings or Rotary Position Embeddings (RoPE), which encode relative rather than absolute positions.

2. Encoder Block

Each encoder block has two sub-layers, each wrapped with residual connection and layer normalization:

  1. Multi-head self-attention: Each token attends to all tokens
  2. Position-wise feed-forward network: Two linear transformations with a non-linearity:
\[\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2\]

The FFN operates on each position independently — it’s the “thinking” step after the “communication” step of attention.

3. Decoder Block

The decoder adds a third sub-layer:

  1. Masked self-attention: Each token can only attend to previous tokens (causal masking), preventing information leakage during autoregressive generation
  2. Cross-attention: The decoder attends to the encoder output (queries from decoder, keys/values from encoder)
  3. Feed-forward network: Same as in the encoder

Masking in Detail

Causal mask for autoregressive generation:

\[\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \text{ (allowed)} \\ -\infty & \text{if } j > i \text{ (blocked)} \end{cases}\]

Adding $-\infty$ before softmax ensures zero attention weight to future tokens:

\[\text{softmax}(QK^\top / \sqrt{d_k} + \text{Mask})\]

Why the Transformer Worked So Well

  1. Parallelism: Unlike RNNs, all positions are processed simultaneously, enabling massive GPU parallelism
  2. Direct long-range connections: Any two positions interact in a single layer
  3. Composability: Stacking layers composes increasingly complex interactions
  4. Residual connections everywhere: Gradient flow is excellent at any depth
  5. Scalability: Performance improves predictably with more data, parameters, and compute

Pre-Norm vs. Post-Norm

The original Transformer uses post-norm: apply layer normalization after the residual addition.

\[x_{l+1} = \text{LN}(x_l + \text{SubLayer}(x_l))\]

Pre-norm moves normalization before the sub-layer:

\[x_{l+1} = x_l + \text{SubLayer}(\text{LN}(x_l))\]

Pre-norm creates a cleaner residual pathway and is more stable for training very deep Transformers. It is the standard in GPT-2/3/4, LLaMA, and most modern LLMs.

The Impact

The Transformer’s impact on deep learning cannot be overstated:

The Transformer didn’t just improve on RNNs — it became the universal architecture for deep learning.

Code Example: Scaled Dot-Product Attention

# See code/ch08_transformer.py for full implementation
import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

Key Takeaways


Previous Chapter: Residual Networks and Skip Connections

Next Chapter: Generative Adversarial Networks

Back to Table of Contents