Skip to content

Self-Attention, Multi-Head, Positional Encodings

The Transformer architecture from Attention Is All You Need rests on three primitives: self-attention (each position attends to all others), multi-head projection (run several attentions in parallel), and positional encoding (inject order). This page works through the math and the design choices.

Self-attention

Given a sequence of token embeddings XRT×d, project to queries, keys, and values:

Q=XWQ,K=XWK,V=XWV,WQ,K,VRd×dk.

The output is the scaled-dot-product attention:

Attn(Q,K,V)=softmax(QKdk)VRT×dk.

Reading the formula: each row qi is compared (via dot product) to every key kj, the scores are softmax-normalised across j, and the output is the weighted sum of values vj.

Three properties:

  • Permutation-equivariant — without positional information, attention treats the input as a set. Order must be added explicitly.
  • O(T2) in sequence length — every pair of positions interacts. The quadratic cost is the central scaling bottleneck (see efficient attention).
  • Direct long-range access — any two positions interact in one attention layer. Compare to RNNs' O(T) gradient path length.

Why dk?

For q,kRdk with i.i.d. unit-variance components, qk has variance dk. As dk grows, dot products become large, the softmax becomes sharply peaked, and gradients of softmax(logit) collapse. Dividing by dk keeps the dot-product variance at 1 regardless of depth.

This is the kind of detail that's easy to miss when reading the paper but critical when implementing. Skipping the dk scaling is a routine bug source.

Multi-head attention

Instead of one attention computation in Rd, run h heads in parallel in Rd/h each:

MHA(X)=Concat(head1,,headh)WO,

with headi=Attn(XWiQ,XWiK,XWiV). Total parameters and FLOPs are roughly the same as single-head attention with the full d, but the representational capacity is split across heads.

Why split? Different heads can specialise:

  • Syntactic — track subject-verb agreement.
  • Positional — track relative offsets.
  • Semantic — track entity relations.

Probing studies (Clark et al., What Does BERT Look At?, 2019) show clean specialisation in some heads. In well-trained large Transformers, many heads are partially redundant — Michel et al. (NeurIPS 2019) showed many heads can be pruned post-training.

Causal masking

For autoregressive generation (decoder-only LMs), each position must attend only to earlier positions. Implement with a causal mask added to the pre-softmax logits:

Mij={0jij>i,Attn(Q,K,V)=softmax(QKdk+M)V.

The becomes 0 after softmax, eliminating attention to future positions. This is what makes GPT-style models autoregressive.

Encoder-decoder attention

In the original encoder-decoder Transformer, the decoder has a third attention sub-layer where queries come from the decoder and keys/values from the encoder output. This is the Bahdanau-attention generalisation in self-attention dress.

Positional encodings

The vanilla self-attention layer is permutation-equivariant — no positional information. The original paper adds sinusoidal positional encodings:

PE(p,2i)=sin(p100002i/d),PE(p,2i+1)=cos(p100002i/d).

Each position gets a deterministic vector that varies smoothly with position and dimension. Linear functions of these encodings can express relative offsets, so attention can learn relative-position relations.

Modern variants:

  • Learned absolute — train one position vector per slot. Used in BERT/GPT-2.
  • Relative position bias — T5, ALiBi: bias attention scores by a function of ij.
  • Rotary Position Embeddings (RoPE) — apply position-dependent rotations in Q,K subspaces. Used in LLaMA, Qwen, and most modern open LLMs. See position encodings.

Released under the MIT License. Content imported and adapted from NoteNextra.