Bahdanau Attention
The single fixed context vector
The mechanism
Replace the single context
The decoder then generates token
What it bought
Three immediate gains over fixed-context seq2seq:
- Long-sentence quality. BLEU on long sentences stopped degrading. The decoder can attend back to a source word even after producing dozens of output tokens.
- Interpretability. Plotting
as a heatmap shows soft alignments between target and source tokens — for the first time, neural translators were inspectable. - Generality. The mechanism is not specific to translation. Any task with a sequence query and a sequence memory can use it: image captioning (attend over image regions; Show, Attend and Tell, Xu et al., ICML 2015), question answering, summarisation.
Bahdanau attention was the first time a network could dynamically pick which inputs to look at based on its current state, rather than having to encode everything into a fixed bottleneck. That's the conceptual lever the rest of the field then pulled on for a decade.
Luong attention — the dot-product variant
Effective Approaches to Attention-based Neural Machine Translation (Luong, Pham, Manning, EMNLP 2015) followed up with three simplifications:
- Score with a dot product
or a bilinear form , instead of an MLP. - Use
(the current decoder state) rather than . - Add local attention — restrict the attention window around a learned alignment position.
Luong's dot-product attention is structurally what the Transformer's self-attention uses (with the
From RNN+attention to Transformer
For three years (2014–2017), the dominant architecture was RNN encoder + attention + RNN decoder. The Transformer (Vaswani et al., NeurIPS 2017) made the radical move of removing the RNNs entirely — keeping only attention, layered repeatedly within both encoder and decoder. The argument: attention already provides direct long-range access, and removing the recurrence makes the architecture trivially parallelisable.
Bahdanau attention is the conceptual origin of self-attention. Reading the 2015 paper alongside the 2017 Transformer paper makes the architectural lineage clear: every attention head in a Transformer is a Bahdanau-style soft-alignment computation, applied between every pair of positions instead of just between decoder and encoder.
What to read next
- Sequence-to-Sequence — the bottleneck this attention mechanism removed.
- Transformer (LLM) — the modern descendant where attention is the only operation.
- LSTM & GRU — the recurrent cells the original Bahdanau encoder/decoder used.