LSTM & GRU
Vanilla RNNs cannot learn dependencies more than ~10 steps apart because gradients vanish through their tanh recurrence. The Long Short-Term Memory cell was designed in 1997 specifically to solve this — by adding a gated additive memory cell whose gradient flows essentially unchanged through time. The GRU is a 2014 simplification that costs less and often performs comparably.
LSTM — the gated cell
Long Short-Term Memory (Hochreiter, Schmidhuber, Neural Computation 1997) introduces a separate cell state
The crucial line is
Forget-gate bias init: the default trick is to initialise
Why it works: error carousel
Hochreiter and Schmidhuber called the additive cell-state path the constant error carousel — gradients can flow back through arbitrarily many steps if the forget gates stay near 1. This is conceptually the same trick as residual connections in ResNet: an identity-like path through the depth/time axis that backprop can use without vanishing.
GRU — fewer gates
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (Chung, Gulcehre, Cho, Bengio, NIPS DLW 2014) proposed the Gated Recurrent Unit, which merges the cell state into the hidden state and uses two gates instead of three:
GRU has ~
Variants worth mentioning
- Peephole connections (Gers, Schmidhuber, 2000) — let gates see the cell state. Modest gains, rarely used today.
- Bi-LSTM — run one LSTM forward and one backward, concatenate hidden states. The default for sequence labelling and the encoder side of pre-Transformer machine translation.
- Stacked LSTMs — multiple LSTM layers, output of layer
feeds layer . 2–4 layers was the sweet spot before residual connections became standard.
Why LSTMs lost to Transformers
LSTMs ruled sequence modelling from 2014 to 2017. The Transformer's advantage is twofold: parallelism (every step can be computed simultaneously, vs LSTM's strict left-to-right) and direct long-range access (self-attention is O(1) hops between any two positions, vs LSTM's O(n)). Both matter for training throughput, and the second matters for modelling — even with gating, LSTMs do degrade on very long contexts.
LSTMs persist in low-latency streaming applications (online ASR, time-series), small-data settings (classical sequence-labelling), and as the recurrent component of some Mamba/SSM-Transformer hybrids.
What to read next
- Vanilla RNNs — the failure mode this design fixes.
- Sequence-to-Sequence — encoder–decoder architectures built on LSTMs.
- Bahdanau Attention — what eventually replaced the LSTM context bottleneck.