Self-Attention, Multi-Head, Positional Encodings

The Transformer architecture from Attention Is All You Need rests on three primitives: self-attention (each position attends to all others), multi-head projection (run several attentions in parallel), and positional encoding (inject order). This page works through the math and the design choices.

Self-attention

Given a sequence of token embeddings $X \in R^{T \times d}$ , project to queries, keys, and values:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}, W^{Q, K, V} \in R^{d \times d_{k}} .

The output is the scaled-dot-product attention:

Attn (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V \in R^{T \times d_{k}} .

Reading the formula: each row $q_{i}$ is compared (via dot product) to every key $k_{j}$ , the scores are softmax-normalised across $j$ , and the output is the weighted sum of values $v_{j}$ .

Three properties:

Permutation-equivariant — without positional information, attention treats the input as a set. Order must be added explicitly.
$O (T^{2})$ in sequence length — every pair of positions interacts. The quadratic cost is the central scaling bottleneck (see efficient attention).
Direct long-range access — any two positions interact in one attention layer. Compare to RNNs' $O (T)$ gradient path length.

Why $\sqrt{d_{k}}$ ?

For $q, k \in R^{d_{k}}$ with i.i.d. unit-variance components, $q^{⊤} k$ has variance $d_{k}$ . As $d_{k}$ grows, dot products become large, the softmax becomes sharply peaked, and gradients of softmax( $l o g i t$ ) collapse. Dividing by $\sqrt{d_{k}}$ keeps the dot-product variance at 1 regardless of depth.

This is the kind of detail that's easy to miss when reading the paper but critical when implementing. Skipping the $\sqrt{d_{k}}$ scaling is a routine bug source.

Multi-head attention

Instead of one attention computation in $R^{d}$ , run $h$ heads in parallel in $R^{d / h}$ each:

MHA (X) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O},

with ${head}_{i} = Attn (X W_{i}^{Q}, X W_{i}^{K}, X W_{i}^{V})$ . Total parameters and FLOPs are roughly the same as single-head attention with the full $d$ , but the representational capacity is split across heads.

Why split? Different heads can specialise:

Syntactic — track subject-verb agreement.
Positional — track relative offsets.
Semantic — track entity relations.

Probing studies (Clark et al., What Does BERT Look At?, 2019) show clean specialisation in some heads. In well-trained large Transformers, many heads are partially redundant — Michel et al. (NeurIPS 2019) showed many heads can be pruned post-training.

Causal masking

For autoregressive generation (decoder-only LMs), each position must attend only to earlier positions. Implement with a causal mask added to the pre-softmax logits:

M_{i j} = {\begin{cases} 0 & j \leq i \\ - \infty & j > i \end{cases}, Attn (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + M) V .

The $- \infty$ becomes 0 after softmax, eliminating attention to future positions. This is what makes GPT-style models autoregressive.

Encoder-decoder attention

In the original encoder-decoder Transformer, the decoder has a third attention sub-layer where queries come from the decoder and keys/values from the encoder output. This is the Bahdanau-attention generalisation in self-attention dress.

Positional encodings

The vanilla self-attention layer is permutation-equivariant — no positional information. The original paper adds sinusoidal positional encodings:

{PE}_{(p, 2 i)} = \sin (\frac{p}{10000^{2 i / d}}), {PE}_{(p, 2 i + 1)} = \cos (\frac{p}{10000^{2 i / d}}) .

Each position gets a deterministic vector that varies smoothly with position and dimension. Linear functions of these encodings can express relative offsets, so attention can learn relative-position relations.

Modern variants:

Learned absolute — train one position vector per slot. Used in BERT/GPT-2.
Relative position bias — T5, ALiBi: bias attention scores by a function of $i - j$ .
Rotary Position Embeddings (RoPE) — apply position-dependent rotations in $Q, K$ subspaces. Used in LLaMA, Qwen, and most modern open LLMs. See position encodings.

Self-Attention, Multi-Head, Positional Encodings ​

Self-attention ​

Why dk? ​

Multi-head attention ​

Causal masking ​

Encoder-decoder attention ​

Positional encodings ​

What to read next ​