LSTM & GRU

Vanilla RNNs cannot learn dependencies more than ~10 steps apart because gradients vanish through their tanh recurrence. The Long Short-Term Memory cell was designed in 1997 specifically to solve this — by adding a gated additive memory cell whose gradient flows essentially unchanged through time. The GRU is a 2014 simplification that costs less and often performs comparably.

LSTM — the gated cell

Long Short-Term Memory (Hochreiter, Schmidhuber, Neural Computation 1997) introduces a separate cell state $c_{t}$ alongside the hidden state $h_{t}$ . The cell state is updated additively, not multiplicatively, with three learned gates controlling what gets written, kept, and read:

\begin{aligned} f_{t} & = σ (W_{f} [h_{t - 1}, x_{t}] + b_{f}) & (forget gate) \\ i_{t} & = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i}) & (input gate) \\ {\tilde{c}}_{t} & = \tanh (W_{c} [h_{t - 1}, x_{t}] + b_{c}) & (candidate cell) \\ c_{t} & = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ {\tilde{c}}_{t} \\ o_{t} & = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o}) & (output gate) \\ h_{t} & = o_{t} ⊙ \tanh (c_{t}) . \end{aligned}

The crucial line is $c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ {\tilde{c}}_{t}$ . If $f_{t} = 1$ and $i_{t} = 0$ , the cell state passes through unchanged — and so does its gradient. Long-range gradient flow becomes possible. Training instabilities reduce dramatically; networks of 100+ steps become tractable.

Forget-gate bias init: the default trick is to initialise $b_{f} = 1$ , so the forget gate starts close to 1 and the cell mostly remembers at the start of training (Jozefowicz et al., 2015).

Why it works: error carousel

Hochreiter and Schmidhuber called the additive cell-state path the constant error carousel — gradients can flow back through arbitrarily many steps if the forget gates stay near 1. This is conceptually the same trick as residual connections in ResNet: an identity-like path through the depth/time axis that backprop can use without vanishing.

GRU — fewer gates

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (Chung, Gulcehre, Cho, Bengio, NIPS DLW 2014) proposed the Gated Recurrent Unit, which merges the cell state into the hidden state and uses two gates instead of three:

\begin{aligned} r_{t} & = σ (W_{r} [h_{t - 1}, x_{t}]) & (reset) \\ z_{t} & = σ (W_{z} [h_{t - 1}, x_{t}]) & (update) \\ {\tilde{h}}_{t} & = \tanh (W_{h} [r_{t} ⊙ h_{t - 1}, x_{t}]) \\ h_{t} & = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t} . \end{aligned}

GRU has ~ $\frac{3}{4}$ the parameters and ~25% fewer multiply-adds per step than LSTM. Empirically, GRUs match LSTMs on most benchmarks (Chung et al. 2014, Greff et al. 2017), with LSTMs slightly better on language modelling and GRUs slightly better on smaller-data tasks.

Variants worth mentioning

Peephole connections (Gers, Schmidhuber, 2000) — let gates see the cell state. Modest gains, rarely used today.
Bi-LSTM — run one LSTM forward and one backward, concatenate hidden states. The default for sequence labelling and the encoder side of pre-Transformer machine translation.
Stacked LSTMs — multiple LSTM layers, output of layer $ℓ$ feeds layer $ℓ + 1$ . 2–4 layers was the sweet spot before residual connections became standard.

Why LSTMs lost to Transformers

LSTMs ruled sequence modelling from 2014 to 2017. The Transformer's advantage is twofold: parallelism (every step can be computed simultaneously, vs LSTM's strict left-to-right) and direct long-range access (self-attention is O(1) hops between any two positions, vs LSTM's O(n)). Both matter for training throughput, and the second matters for modelling — even with gating, LSTMs do degrade on very long contexts.

LSTMs persist in low-latency streaming applications (online ASR, time-series), small-data settings (classical sequence-labelling), and as the recurrent component of some Mamba/SSM-Transformer hybrids.

LSTM & GRU ​

LSTM — the gated cell ​

Why it works: error carousel ​

GRU — fewer gates ​

Variants worth mentioning ​

Why LSTMs lost to Transformers ​

What to read next ​