Vanilla RNNs
A recurrent neural network processes a sequence one element at a time, maintaining a hidden state that carries information forward. The vanilla RNN is the simplest version — one tied weight matrix applied at every step. Conceptually clean, painful to train at long horizons, and the gateway concept for everything from LSTMs and seq2seq to the original attention mechanism.
The recurrence
Given an input sequence
with output
Backpropagation through time (BPTT)
The gradient with respect to a weight that participates at every step is the sum of contributions from every step:
The product of Jacobians inside the parentheses is the entire problem. For long sequences,
Vanishing and exploding gradients
The Jacobian
- If
, gradients vanish exponentially in . Long-range dependencies become invisible. - If the product is
, gradients explode. Training diverges.
For tanh,
Gradient clipping is the standard runtime patch for explosions: clip
Truncated BPTT
Backpropagating through 1000+ steps is prohibitive in memory (must store every hidden state) and gradient quality. The standard hack is truncated BPTT: back-propagate through only the last
What vanilla RNNs are useful for
Despite the training problems, vanilla RNNs:
- Are expressive — Turing-complete given infinite precision (Siegelmann, 1991).
- Are useful for short sequences (text < 20 tokens, audio frames within a phoneme).
- Are conceptually the right starting point for understanding all recurrent architectures.
- Serve as the baseline against which LSTMs/GRUs/Transformers are measured.
In modern practice, almost no production system uses a vanilla RNN — but the recurrence is the foundation for the gated variants and Mamba/SSM that bring recurrence back at scale.
What to read next
- LSTM & GRU — gated recurrences that solve the vanishing-gradient problem.
- Sequence-to-Sequence — encoder–decoder architectures built on RNNs.
- Backpropagation — the underlying gradient algorithm BPTT specialises.