Language, music, time series, video — the world is full of sequential data where context from the past is essential for understanding the present. A fully connected or convolutional network processes each input independently, with no memory of what came before. Recurrent Neural Networks (RNNs) were designed to fix this.
An RNN maintains a hidden state $h_t$ that is updated at each time step:
\(h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\) \(y_t = W_{hy} h_t + b_y\)
The hidden state acts as a memory — it carries information from all previous inputs. The same weight matrices $W_{hh}$, $W_{xh}$, and $W_{hy}$ are shared across all time steps, making RNNs parameter-efficient regardless of sequence length.
Training an RNN means “unrolling” it through time and applying standard backpropagation to the unrolled graph. Gradients must flow backward through all time steps:
\[\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}}\]This product of Jacobian matrices has the same fundamental problem as deep feedforward networks: vanishing and exploding gradients, but now through time rather than through layers.
In practice, vanilla RNNs can only remember about 10–20 time steps. For natural language, this is catastrophic:
“The cat, which had been sitting on the mat near the door that was left open by the guest who arrived yesterday, was sleeping.”
The verb “was” depends on the subject “cat” — many words earlier. A vanilla RNN forgets the subject long before reaching the verb.
Sepp Hochreiter and Jürgen Schmidhuber introduced the Long Short-Term Memory (LSTM) in 1997. The key innovation was adding explicit gating mechanisms that control information flow.
An LSTM introduces a cell state $C_t$ — a separate pathway that carries information across time steps with minimal interference. The cell state is like a conveyor belt: information flows along it with only linear interactions, making it much easier for gradients to propagate.
The forget gate looks at the previous hidden state and current input, and outputs a value between 0 and 1 for each element of the cell state. A value of 1 means “keep everything,” and 0 means “forget completely.”
Example: When processing a sentence and encountering a new subject, the forget gate might erase the old subject from memory.
\(i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\) \(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\)
The input gate has two parts: a sigmoid that decides which values to update, and a tanh that creates candidate new values. Their product determines the new information added to the cell state.
\(o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\) \(h_t = o_t \odot \tanh(C_t)\)
The output gate controls which parts of the cell state are exposed as the hidden state (and thus influence the output and the next time step).
The cell state update combines the forget and input gates:
\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]This is the critical equation. The cell state is updated through:
The use of addition (not multiplication) is why LSTMs solve the vanishing gradient problem. Gradients flow through the cell state via addition, which preserves them much better than multiplication through a nonlinearity.
All three gates use sigmoid activation, outputting values in $(0, 1)$. This is deliberate:
The candidate cell state uses tanh (outputting values in $(-1, 1)$) because it can both add and subtract information.
Gers and Schmidhuber added “peephole” connections that let gates look directly at the cell state:
\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + W_{pf} \cdot C_{t-1} + b_f)\]This gives gates access to the current memory content, not just the hidden state.
Instead of separate forget and input gates, use a single gate for both:
\[C_t = f_t \odot C_{t-1} + (1 - f_t) \odot \tilde{C}_t\]The intuition: if you’re forgetting something, you should be remembering something new in its place. This reduces parameters with minimal performance loss.
Kyunghyun Cho and colleagues proposed the Gated Recurrent Unit, which simplifies the LSTM by merging the cell state and hidden state into a single vector and using only two gates:
\(\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)\) \(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\)
The update gate $z_t$ acts as both the forget and input gate of an LSTM: it interpolates between the old state and the candidate new state.
| Feature | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| State vectors | 2 (cell state, hidden state) | 1 (hidden state) |
| Parameters | More | ~25% fewer |
| Performance | Slightly better on complex tasks | Comparable, trains faster |
| Memory | More memory per cell | Less memory per cell |
In practice, the performance difference is often negligible. GRUs train faster due to fewer parameters and are preferred when computational resources are limited. LSTMs remain the default for tasks requiring fine-grained control over memory.
LSTM and GRU networks dominated sequential tasks from 2014 to 2017:
The seq2seq framework (Sutskever et al., 2014) was a landmark: an LSTM encoder reads the input sequence and compresses it into a fixed-size context vector, then an LSTM decoder generates the output sequence from this vector.
This architecture powered the first neural machine translation systems. However, compressing an entire sentence into a single fixed-size vector was a bottleneck — it struggled with long sentences. The solution (attention) would come in Chapter 8.
A standard RNN only has access to past context. Bidirectional RNNs process the sequence in both directions:
\(\overrightarrow{h}_t = \text{RNN}(x_t, \overrightarrow{h}_{t-1})\) \(\overleftarrow{h}_t = \text{RNN}(x_t, \overleftarrow{h}_{t+1})\) \(h_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t]\)
The concatenated hidden state $h_t$ has context from both past and future, which is crucial for tasks like named entity recognition where the meaning of a word depends on the words around it.
Multiple RNN layers can be stacked, with the hidden state of one layer serving as input to the next:
\[h_t^{(l)} = \text{LSTM}(h_t^{(l-1)}, h_{t-1}^{(l)})\]Google’s Neural Machine Translation (2016) used 8-layer LSTMs. However, very deep RNNs are harder to train than deep CNNs because gradients must flow both through layers and through time.
# See code/ch05_lstm_gru.py for full training pipeline
import torch.nn as nn
class SequenceClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim=128,
hidden_dim=256, num_classes=2, use_gru=False):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
RNNClass = nn.GRU if use_gru else nn.LSTM
self.rnn = RNNClass(embed_dim, hidden_dim,
num_layers=2, batch_first=True,
bidirectional=True, dropout=0.3)
self.classifier = nn.Linear(hidden_dim * 2, num_classes)
Despite their success, gated RNNs had fundamental limitations:
These limitations motivated two key innovations:
Previous Chapter: The Vanishing Gradient Problem and Its Solutions
Next Chapter: Regularization Strategies — Taming Overfitting