Efficient Attention (Linformer, Performer, Reformer)
Self-attention is
Why is the problem
Attention's
Two attack strategies:
- Sparse attention — restrict each query to attend to only a subset of keys.
- Low-rank / kernel attention — approximate the full softmax by something computable in
.
Reformer — locality-sensitive hashing
Reformer: The Efficient Transformer (Kitaev, Kaiser, Levskaya, ICLR 2020) used LSH (locality-sensitive hashing) to bucket queries and keys into hash bins; each query attends only to keys in the same bin. The attention complexity drops to
Quality on WikiText was competitive but not better than dense attention at the same parameter count. Reformer remains a clean illustration of the LSH-attention idea.
Linformer — low-rank projection
Linformer: Self-Attention with Linear Complexity (Wang, Li, Khabsa, Fang, Ma, 2020) noted that the
with
Performer — random Fourier features
Performer: Rethinking Attention with Performers (Choromanski et al., ICLR 2021) approximated the softmax kernel with random Fourier features, allowing the attention computation to be re-associated:
The right factor is
Longformer and BigBird — sparse + global
Longformer (Beltagy, Peters, Cohan, 2020) and BigBird (Zaheer et al., NeurIPS 2020) used structured sparse attention patterns: each token attends to a local window plus a small set of "global" tokens (e.g., a [CLS] token that attends to and from everything). Complexity is
Why none of them won at the frontier
Three reasons frontier LLMs (GPT-4, Claude, Gemini) use mostly dense attention despite all this efficient-attention work:
- Quality gap. Approximate attention loses 1–3 perplexity points vs dense at the same scale. For a frontier model, that's a non-starter.
- Engineering complexity. Custom kernels for each scheme, awkward interactions with KV caching, harder to debug.
- FlashAttention (Dao et al., 2022) — the IO-aware kernel that runs exact attention faster than most of the approximations, by reorganising memory access rather than changing the math. Once FlashAttention existed, the practical motivation for approximate attention largely disappeared.
The exception: state-space models (Mamba, RWKV) and linear attention revived the linear-complexity idea with new architectures and now compete with Transformers at long contexts. See linear attention and Mamba.
What survived
- Sparse-window attention in long-document encoders (Longformer-style) — still standard for processing huge inputs efficiently.
- The conceptual taxonomy — sparse vs low-rank vs kernel approximations — is the right way to organise modern long-context literature.
- The
-is-a-problem framing — drove FlashAttention, Mamba, and the long-context engineering of frontier LLMs.
What to read next
- Long-Context Transformers — the modern engineering answer.
- Mamba — state-space models as attention alternatives.
- Linear Attention — the modern revival of kernel-style attention.