Recall from Chapter 5: in sequence-to-sequence models, an LSTM encoder compresses an entire input sequence into a single fixed-size vector, and the decoder generates output from that vector. For short sentences, this works. For long sentences, critical information is inevitably lost.
The analogy: imagine summarizing a 500-page book into a single paragraph, and then asking someone to reconstruct the book from that paragraph. Some information will be lost, no matter how good the summarizer.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed a simple but profound idea: instead of compressing the entire input into one vector, let the decoder look back at all encoder hidden states at each decoding step, paying more attention to the relevant parts.
At each decoding step $t$:
where $a$ is a learned alignment function (a small neural network).
Luong et al. explored different ways to compute alignment scores:
| Variant | Score Function | Properties |
|---|---|---|
| Dot product | $e_{ti} = s_t^\top h_i$ | Simplest, no parameters |
| General | $e_{ti} = s_t^\top W h_i$ | Learned bilinear form |
| Concat | $e_{ti} = v^\top \tanh(W[s_t; h_i])$ | Bahdanau’s original |
The dot product is the simplest and fastest. It works when the encoder and decoder hidden states have the same dimension.
For high-dimensional vectors, dot products grow in magnitude, pushing softmax into saturated regions with near-zero gradients. The fix is to scale by the square root of the dimension:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]where:
The scaling factor $\frac{1}{\sqrt{d_k}}$ keeps the dot products in a range where softmax has meaningful gradients. This is a small detail with large practical impact.
The next leap: instead of attending from decoder to encoder, let a sequence attend to itself. Each position computes attention over all positions in the same sequence.
For a sequence of token embeddings $X$:
\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\]Each token generates a query (“what am I looking for?”), a key (“what do I represent?”), and a value (“what information do I carry?”). The attention output for each token is a weighted combination of all values, where the weights depend on the query-key compatibility.
| Property | RNN | Self-Attention |
|---|---|---|
| Captures long-range dependencies | Poorly (gradient decay) | Directly (any two positions interact) |
| Parallelizable | No (sequential) | Yes (all positions computed in parallel) |
| Computational complexity per layer | $O(n \cdot d^2)$ | $O(n^2 \cdot d)$ |
| Path length between positions | $O(n)$ | $O(1)$ |
The trade-off: self-attention is $O(n^2)$ in sequence length, which becomes expensive for very long sequences. This limitation would motivate later work on efficient attention (Chapter 15).
Instead of computing a single attention function, compute $h$ attention functions in parallel, each with different learned projections:
\(\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)\) \(\text{MultiHead}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W_O\)
Each head can learn to attend to different types of relationships:
In the original Transformer, $d_{\text{model}} = 512$ with $h = 8$ heads, so each head operates in $d_k = 64$ dimensions.
Vaswani et al. from Google published the landmark paper that replaced recurrence entirely with attention. The Transformer architecture became the foundation for virtually all modern deep learning.
Output Probabilities
↑
Linear + Softmax
↑
┌── Decoder Block ──┐ × N
│ Masked Self-Attn │
│ Cross-Attention │
│ Feed-Forward │
│ (+ Residual + LN)│
└───────────────────┘
↑
┌── Encoder Block ──┐ × N
│ Self-Attention │
│ Feed-Forward │
│ (+ Residual + LN)│
└───────────────────┘
↑
Positional Encoding + Embeddings
↑
Input Tokens
Self-attention is permutation invariant — it doesn’t know the order of tokens. Positional encodings inject position information using sine and cosine functions of different frequencies:
\(PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)\) \(PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)\)
The sinusoidal encoding allows the model to learn relative positions and generalize to sequence lengths not seen during training (in theory).
Modern models often use learned positional embeddings or Rotary Position Embeddings (RoPE), which encode relative rather than absolute positions.
Each encoder block has two sub-layers, each wrapped with residual connection and layer normalization:
The FFN operates on each position independently — it’s the “thinking” step after the “communication” step of attention.
The decoder adds a third sub-layer:
Causal mask for autoregressive generation:
\[\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \text{ (allowed)} \\ -\infty & \text{if } j > i \text{ (blocked)} \end{cases}\]Adding $-\infty$ before softmax ensures zero attention weight to future tokens:
\[\text{softmax}(QK^\top / \sqrt{d_k} + \text{Mask})\]The original Transformer uses post-norm: apply layer normalization after the residual addition.
\[x_{l+1} = \text{LN}(x_l + \text{SubLayer}(x_l))\]Pre-norm moves normalization before the sub-layer:
\[x_{l+1} = x_l + \text{SubLayer}(\text{LN}(x_l))\]Pre-norm creates a cleaner residual pathway and is more stable for training very deep Transformers. It is the standard in GPT-2/3/4, LLaMA, and most modern LLMs.
The Transformer’s impact on deep learning cannot be overstated:
The Transformer didn’t just improve on RNNs — it became the universal architecture for deep learning.
# See code/ch08_transformer.py for full implementation
import torch
import torch.nn.functional as F
import math
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V), weights
Previous Chapter: Residual Networks and Skip Connections
Next Chapter: Generative Adversarial Networks