Chapter 5: Recurrent Networks and Gating Mechanisms — LSTM & GRU (1997–2014)

The Challenge of Sequential Data

Language, music, time series, video — the world is full of sequential data where context from the past is essential for understanding the present. A fully connected or convolutional network processes each input independently, with no memory of what came before. Recurrent Neural Networks (RNNs) were designed to fix this.

Vanilla RNNs

An RNN maintains a hidden state $h_t$ that is updated at each time step:

\(h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\) \(y_t = W_{hy} h_t + b_y\)

The hidden state acts as a memory — it carries information from all previous inputs. The same weight matrices $W_{hh}$, $W_{xh}$, and $W_{hy}$ are shared across all time steps, making RNNs parameter-efficient regardless of sequence length.

Backpropagation Through Time (BPTT)

Training an RNN means “unrolling” it through time and applying standard backpropagation to the unrolled graph. Gradients must flow backward through all time steps:

\[\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_T} \prod_{k=t+1}^{T} \frac{\partial h_k}{\partial h_{k-1}}\]

This product of Jacobian matrices has the same fundamental problem as deep feedforward networks: vanishing and exploding gradients, but now through time rather than through layers.

The Long-Range Dependency Problem

In practice, vanilla RNNs can only remember about 10–20 time steps. For natural language, this is catastrophic:

“The cat, which had been sitting on the mat near the door that was left open by the guest who arrived yesterday, was sleeping.”

The verb “was” depends on the subject “cat” — many words earlier. A vanilla RNN forgets the subject long before reaching the verb.

LSTM: Learning When to Remember and Forget (1997)

Sepp Hochreiter and Jürgen Schmidhuber introduced the Long Short-Term Memory (LSTM) in 1997. The key innovation was adding explicit gating mechanisms that control information flow.

The Cell State: A Highway for Information

An LSTM introduces a cell state $C_t$ — a separate pathway that carries information across time steps with minimal interference. The cell state is like a conveyor belt: information flows along it with only linear interactions, making it much easier for gradients to propagate.

The Three Gates

1. Forget Gate — “What should I throw away?”

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]

The forget gate looks at the previous hidden state and current input, and outputs a value between 0 and 1 for each element of the cell state. A value of 1 means “keep everything,” and 0 means “forget completely.”

Example: When processing a sentence and encountering a new subject, the forget gate might erase the old subject from memory.

2. Input Gate — “What new information should I store?”

\(i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\) \(\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\)

The input gate has two parts: a sigmoid that decides which values to update, and a tanh that creates candidate new values. Their product determines the new information added to the cell state.

3. Output Gate — “What should I output?”

\(o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\) \(h_t = o_t \odot \tanh(C_t)\)

The output gate controls which parts of the cell state are exposed as the hidden state (and thus influence the output and the next time step).

Cell State Update

The cell state update combines the forget and input gates:

\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]

This is the critical equation. The cell state is updated through:

The use of addition (not multiplication) is why LSTMs solve the vanishing gradient problem. Gradients flow through the cell state via addition, which preserves them much better than multiplication through a nonlinearity.

Why the Gates Are Sigmoid

All three gates use sigmoid activation, outputting values in $(0, 1)$. This is deliberate:

The candidate cell state uses tanh (outputting values in $(-1, 1)$) because it can both add and subtract information.

LSTM Variants

Peephole Connections (2000)

Gers and Schmidhuber added “peephole” connections that let gates look directly at the cell state:

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + W_{pf} \cdot C_{t-1} + b_f)\]

This gives gates access to the current memory content, not just the hidden state.

Coupled Forget-Input Gate

Instead of separate forget and input gates, use a single gate for both:

\[C_t = f_t \odot C_{t-1} + (1 - f_t) \odot \tilde{C}_t\]

The intuition: if you’re forgetting something, you should be remembering something new in its place. This reduces parameters with minimal performance loss.

GRU: A Simpler Alternative (2014)

Kyunghyun Cho and colleagues proposed the Gated Recurrent Unit, which simplifies the LSTM by merging the cell state and hidden state into a single vector and using only two gates:

Reset Gate — “How much past information to ignore”

\[r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)\]

Update Gate — “How much past information to keep”

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)\]

State Update

\(\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)\) \(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\)

The update gate $z_t$ acts as both the forget and input gate of an LSTM: it interpolates between the old state and the candidate new state.

LSTM vs. GRU

Feature LSTM GRU
Gates 3 (forget, input, output) 2 (reset, update)
State vectors 2 (cell state, hidden state) 1 (hidden state)
Parameters More ~25% fewer
Performance Slightly better on complex tasks Comparable, trains faster
Memory More memory per cell Less memory per cell

In practice, the performance difference is often negligible. GRUs train faster due to fewer parameters and are preferred when computational resources are limited. LSTMs remain the default for tasks requiring fine-grained control over memory.

The Impact of Gated RNNs

LSTM and GRU networks dominated sequential tasks from 2014 to 2017:

Sequence-to-Sequence (Seq2Seq) Models

The seq2seq framework (Sutskever et al., 2014) was a landmark: an LSTM encoder reads the input sequence and compresses it into a fixed-size context vector, then an LSTM decoder generates the output sequence from this vector.

This architecture powered the first neural machine translation systems. However, compressing an entire sentence into a single fixed-size vector was a bottleneck — it struggled with long sentences. The solution (attention) would come in Chapter 8.

Bidirectional RNNs

A standard RNN only has access to past context. Bidirectional RNNs process the sequence in both directions:

\(\overrightarrow{h}_t = \text{RNN}(x_t, \overrightarrow{h}_{t-1})\) \(\overleftarrow{h}_t = \text{RNN}(x_t, \overleftarrow{h}_{t+1})\) \(h_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t]\)

The concatenated hidden state $h_t$ has context from both past and future, which is crucial for tasks like named entity recognition where the meaning of a word depends on the words around it.

Deep RNNs: Stacking Layers

Multiple RNN layers can be stacked, with the hidden state of one layer serving as input to the next:

\[h_t^{(l)} = \text{LSTM}(h_t^{(l-1)}, h_{t-1}^{(l)})\]

Google’s Neural Machine Translation (2016) used 8-layer LSTMs. However, very deep RNNs are harder to train than deep CNNs because gradients must flow both through layers and through time.

Code Example: LSTM vs. GRU for Sequence Classification

# See code/ch05_lstm_gru.py for full training pipeline
import torch.nn as nn

class SequenceClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim=128,
                 hidden_dim=256, num_classes=2, use_gru=False):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        RNNClass = nn.GRU if use_gru else nn.LSTM
        self.rnn = RNNClass(embed_dim, hidden_dim,
                           num_layers=2, batch_first=True,
                           bidirectional=True, dropout=0.3)
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

The Limitations That Led to Transformers

Despite their success, gated RNNs had fundamental limitations:

  1. Sequential computation: Each time step depends on the previous one, preventing parallelization. Training on long sequences is inherently slow.
  2. Limited memory: Even with gating, LSTMs struggle with dependencies beyond ~500 tokens in practice
  3. Fixed bottleneck: The seq2seq context vector is a fixed-size compression of the entire input, losing information for long sequences
  4. Difficulty with very long-range dependencies: The theoretical ability to maintain information indefinitely doesn’t hold in practice for very long sequences

These limitations motivated two key innovations:

Key Takeaways


Previous Chapter: The Vanishing Gradient Problem and Its Solutions

Next Chapter: Regularization Strategies — Taming Overfitting

Back to Table of Contents