Self-Attention, Multi-Head, Positional Encodings
The Transformer architecture from Attention Is All You Need rests on three primitives: self-attention (each position attends to all others), multi-head projection (run several attentions in parallel), and positional encoding (inject order). This page works through the math and the design choices.
Self-attention
Given a sequence of token embeddings
The output is the scaled-dot-product attention:
Reading the formula: each row
Three properties:
- Permutation-equivariant — without positional information, attention treats the input as a set. Order must be added explicitly.
in sequence length — every pair of positions interacts. The quadratic cost is the central scaling bottleneck (see efficient attention). - Direct long-range access — any two positions interact in one attention layer. Compare to RNNs'
gradient path length.
Why ?
For
This is the kind of detail that's easy to miss when reading the paper but critical when implementing. Skipping the
Multi-head attention
Instead of one attention computation in
with
Why split? Different heads can specialise:
- Syntactic — track subject-verb agreement.
- Positional — track relative offsets.
- Semantic — track entity relations.
Probing studies (Clark et al., What Does BERT Look At?, 2019) show clean specialisation in some heads. In well-trained large Transformers, many heads are partially redundant — Michel et al. (NeurIPS 2019) showed many heads can be pruned post-training.
Causal masking
For autoregressive generation (decoder-only LMs), each position must attend only to earlier positions. Implement with a causal mask added to the pre-softmax logits:
The
Encoder-decoder attention
In the original encoder-decoder Transformer, the decoder has a third attention sub-layer where queries come from the decoder and keys/values from the encoder output. This is the Bahdanau-attention generalisation in self-attention dress.
Positional encodings
The vanilla self-attention layer is permutation-equivariant — no positional information. The original paper adds sinusoidal positional encodings:
Each position gets a deterministic vector that varies smoothly with position and dimension. Linear functions of these encodings can express relative offsets, so attention can learn relative-position relations.
Modern variants:
- Learned absolute — train one position vector per slot. Used in BERT/GPT-2.
- Relative position bias — T5, ALiBi: bias attention scores by a function of
. - Rotary Position Embeddings (RoPE) — apply position-dependent rotations in
subspaces. Used in LLaMA, Qwen, and most modern open LLMs. See position encodings.
What to read next
- Attention Is All You Need — the broader paper.
- Efficient Attention — how to break the
cost. - Position Encodings (RoPE etc.) — the modern positional schemes.