Bahdanau Attention

The single fixed context vector $c$ in a seq2seq translator is an information bottleneck — long source sentences cannot be encoded into one 1024-d vector without information loss. Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau, Cho, Bengio, ICLR 2015) introduced the attention mechanism that fixed this, and in doing so set the conceptual foundation for the Transformer.

The mechanism

Replace the single context $c$ with a per-step context $c_{t}$ computed as a weighted sum over all encoder hidden states. Given encoder outputs $h_{1}^{enc}, \dots, h_{T}^{enc}$ and the decoder state $s_{t - 1}$ at the previous step, compute:

e_{t, j} = v_{a}^{⊤} \tanh (W_{a} s_{t - 1} + U_{a} h_{j}^{enc}),

α_{t, j} = \frac{\exp (e_{t, j})}{\sum_{k = 1}^{T} \exp (e_{t, k})}, c_{t} = \sum_{j = 1}^{T} α_{t, j} h_{j}^{enc} .

The decoder then generates token $y_{t}$ conditioned on $s_{t - 1}, y_{t - 1}, c_{t}$ . The scoring function $e_{t, j}$ is a small MLP — what would later be called additive attention, distinct from the dot-product attention used in the Transformer.

What it bought

Three immediate gains over fixed-context seq2seq:

Long-sentence quality. BLEU on long sentences stopped degrading. The decoder can attend back to a source word even after producing dozens of output tokens.
Interpretability. Plotting $α_{t, j}$ as a heatmap shows soft alignments between target and source tokens — for the first time, neural translators were inspectable.
Generality. The mechanism is not specific to translation. Any task with a sequence query and a sequence memory can use it: image captioning (attend over image regions; Show, Attend and Tell, Xu et al., ICML 2015), question answering, summarisation.

Bahdanau attention was the first time a network could dynamically pick which inputs to look at based on its current state, rather than having to encode everything into a fixed bottleneck. That's the conceptual lever the rest of the field then pulled on for a decade.

Luong attention — the dot-product variant

Effective Approaches to Attention-based Neural Machine Translation (Luong, Pham, Manning, EMNLP 2015) followed up with three simplifications:

Score with a dot product $e_{t, j} = s_{t}^{⊤} h_{j}^{enc}$ or a bilinear form $s_{t}^{⊤} W_{a} h_{j}^{enc}$ , instead of an MLP.
Use $s_{t}$ (the current decoder state) rather than $s_{t - 1}$ .
Add local attention — restrict the attention window around a learned alignment position.

Luong's dot-product attention is structurally what the Transformer's self-attention uses (with the $\sqrt{d_{k}}$ scaling and multi-head projection added on top). The "scaled dot-product attention" of Attention is All You Need is essentially Luong attention applied to itself.

From RNN+attention to Transformer

For three years (2014–2017), the dominant architecture was RNN encoder + attention + RNN decoder. The Transformer (Vaswani et al., NeurIPS 2017) made the radical move of removing the RNNs entirely — keeping only attention, layered repeatedly within both encoder and decoder. The argument: attention already provides direct long-range access, and removing the recurrence makes the architecture trivially parallelisable.

Bahdanau attention is the conceptual origin of self-attention. Reading the 2015 paper alongside the 2017 Transformer paper makes the architectural lineage clear: every attention head in a Transformer is a Bahdanau-style soft-alignment computation, applied between every pair of positions instead of just between decoder and encoder.

Bahdanau Attention ​

The mechanism ​

What it bought ​

Luong attention — the dot-product variant ​

From RNN+attention to Transformer ​

What to read next ​

Bahdanau Attention

The mechanism

What it bought

Luong attention — the dot-product variant

From RNN+attention to Transformer

What to read next