Skip to content

The Transformer

The 2017 architecture that replaced both convolutions and recurrence in NLP, then in vision, then in essentially everything. The whole rest of this site is a chronicle of what people built on top of it.

The core idea

A self-attention layer turns a sequence XRn×d into a new sequence by letting every position attend to every other position. Concretely:

Attention(Q,K,V)=softmax(QKdk)V,

where Q=XWQ, K=XWK, V=XWV are linear projections of the input. The softmax row for position i is a probability distribution over all positions, used to take a weighted sum of the value vectors.

The dk scaling keeps the dot products from growing too large in high dimensions and saturating the softmax.

Multi-head attention

Run h attention operations in parallel with different projections, concatenate, and project once more:

MHA(X)=Concat(head1,,headh)WO.

Each head can specialise (e.g. one tracks subject–verb agreement, another tracks coreference); empirically the model exploits this without supervision.

A Transformer block

Each block is

text
x ← x + MHA(LayerNorm(x))
x ← x + FFN(LayerNorm(x))

with a position-wise feed-forward network FFN(x)=W2GELU(W1x+b1)+b2. Pre-LN (placing LayerNorm before the sub-layer) has become the standard because it trains stably without warmup tricks.

Positional information

Self-attention is permutation-equivariant — without help, the model can't tell the dog bit the man from the man bit the dog. The original paper added sinusoidal positional encodings; later variants (RoPE, ALiBi — see long-context) inject position directly into Q and K.

Why it won

  1. Parallelism. Unlike RNNs, every position can be processed in parallel — a perfect fit for GPU hardware.
  2. Long-range dependencies. Information moves between any two positions in O(1) layers, not O(n).
  3. Scaling. The architecture scales smoothly to billions of parameters and trillions of tokens — see Scaling Laws.

Reading list

  • Attention Is All You Need — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NeurIPS 2017.

Released under the MIT License. Content imported and adapted from NoteNextra.