Vanilla RNNs

A recurrent neural network processes a sequence one element at a time, maintaining a hidden state that carries information forward. The vanilla RNN is the simplest version — one tied weight matrix applied at every step. Conceptually clean, painful to train at long horizons, and the gateway concept for everything from LSTMs and seq2seq to the original attention mechanism.

The recurrence

Given an input sequence $x_{1}, x_{2}, \dots, x_{T}$ , a vanilla RNN computes hidden states

h_{t} = ϕ (W_{x h} x_{t} + W_{h h} h_{t - 1} + b_{h}),

with output $y_{t} = W_{h y} h_{t} + b_{y}$ and $ϕ$ usually tanh. The same weights $W_{x h}, W_{h h}, W_{h y}$ are used at every step — parameter sharing across time, the structural analogue of weight sharing across space in a CNN.

Backpropagation through time (BPTT)

The gradient with respect to a weight that participates at every step is the sum of contributions from every step:

\frac{\partial L}{\partial W_{h h}} = \sum_{t = 1}^{T} \sum_{k = 1}^{t} \frac{\partial L_{t}}{\partial h_{t}} (\prod_{j = k + 1}^{t} \frac{\partial h_{j}}{\partial h_{j - 1}}) \frac{\partial h_{k}}{\partial W_{h h}} .

The product of Jacobians inside the parentheses is the entire problem. For long sequences, $T$ is large and the product collapses (vanishing gradient) or blows up (exploding gradient).

Vanishing and exploding gradients

The Jacobian $\partial h_{j} / \partial h_{j - 1} = W_{h h}^{⊤} diag (ϕ^{'} (\cdot))$ . In an iterated product of $t$ such matrices:

If $∥ W_{h h} ∥_{2} \cdot ∥ ϕ^{'} ∥_{\infty} < 1$ , gradients vanish exponentially in $t$ . Long-range dependencies become invisible.
If the product is $> 1$ , gradients explode. Training diverges.

For tanh, $| ϕ^{'} | \leq 1$ everywhere, so vanishing dominates — vanilla RNNs cannot reliably learn dependencies more than ~10 steps apart. This was the central practical limitation that motivated LSTMs.

Gradient clipping is the standard runtime patch for explosions: clip $\nabla$ to a maximum norm before each step. It does nothing for vanishing.

Truncated BPTT

Backpropagating through 1000+ steps is prohibitive in memory (must store every hidden state) and gradient quality. The standard hack is truncated BPTT: back-propagate through only the last $K$ steps (typically $K = 32$ – $200$ ), even if the forward pass uses much longer histories. This trades a biased gradient for a tractable one and is the workhorse for LM training on RNNs.

What vanilla RNNs are useful for

Despite the training problems, vanilla RNNs:

Are expressive — Turing-complete given infinite precision (Siegelmann, 1991).
Are useful for short sequences (text < 20 tokens, audio frames within a phoneme).
Are conceptually the right starting point for understanding all recurrent architectures.
Serve as the baseline against which LSTMs/GRUs/Transformers are measured.

In modern practice, almost no production system uses a vanilla RNN — but the recurrence is the foundation for the gated variants and Mamba/SSM that bring recurrence back at scale.

Vanilla RNNs ​

The recurrence ​

Backpropagation through time (BPTT) ​

Vanishing and exploding gradients ​

Truncated BPTT ​

What vanilla RNNs are useful for ​

What to read next ​