Sliding-Window & Sparse Attention
The 2023-2024 generation of sparse-attention techniques restricted attention to a structured subset of (query, key) pairs to keep cost sub-quadratic. Sliding-window attention (Mistral 7B, Longformer, Mamba's local-attention layers) is the practical default; sparse and structured attention (Native Sparse Attention, Sinkhorn attention, BigBird-style) round out the literature. Together they're the third leg of long-context — alongside linear attention / SSMs and position-encoding extensions.
Sliding-window attention
The simplest sparse pattern: each query attends only to the previous
- Causal sliding window — query at position
attends to keys in . Linear cost. - Bidirectional sliding window — for non-causal models, attend to
. - Dilated sliding window — every
-th key in the window. Increases effective receptive field at fixed cost.
Mistral 7B made sliding-window attention a frontier-LLM staple. With
Local + global hybrid: Longformer & BigBird
Longformer (Beltagy, Peters, Cohan, 2020) and BigBird (Zaheer et al., NeurIPS 2020) combined three patterns:
- Local sliding window for nearby context.
- Global tokens that attend to and from every other position (e.g., a
[CLS]token). - Random attention over a small subset of additional pairs.
The combination preserves the theoretical universality of full attention (BigBird is provably a universal Turing-complete approximator) while keeping cost
Native Sparse Attention (DeepSeek, 2025)
Native Sparse Attention (Yuan, Tang, Zhao et al., DeepSeek 2025) is the current frontier of sparse attention. NSA decomposes each attention pattern into three branches:
- Compressed coarse attention — every
-th token's compressed K/V representation. Captures long-range gist at low cost. - Selected fine attention — top-
blocks chosen via a learned scorer. Captures the locally-relevant retrieval pattern. - Sliding window — recent context exactly.
Each branch is hardware-efficient; the combination is trainable end-to-end. NSA reports matching dense-attention quality at substantially lower compute, particularly on long-context retrieval and reasoning benchmarks.
NSA is the first sparse-attention work to be deployed at frontier scale — DeepSeek V3.5+ uses it natively. Earlier sparse-attention research mostly involved retrofitting onto pretrained dense models, which lost quality.
Sliding-window in practice
Modern open-LLM architectures use sliding-window attention selectively:
- Mistral 7B / Mixtral — sliding window 4096 in all layers.
- Llama 3 — full attention up to 8K (then RoPE-extended).
- Llama 4 (when MoE) — interleaves sliding-window and full-attention layers, similar to the hybrid Mamba-Transformer pattern.
- Gemini 1.5 — proprietary mix of local and global attention layers.
The trade-off: pure full attention preserves capability but quadratic cost; pure sliding window is linear but limits long-range exact retrieval. Hybrid stacks — alternating sliding-window and full-attention layers — are the modern default.
Why sparse attention is "the third option"
For long context, the field has three architectural levers:
- Linear attention / SSMs — change the math,
training and per-token state, less ICL. - Sparse / windowed attention — keep the math, restrict the support set. Predictable and stable.
- Engineering — FlashAttention + KV-cache compression — exact attention faster, but still quadratic.
Sparse attention is the least disruptive choice — it slots into existing Transformer infrastructure without changing inference semantics. That's why it's the default in production long-context models.
Limitations
- Random / structured patterns sacrifice some retrieval. Truly long-range exact matches must be on the local-or-global path; off-path positions are invisible.
- Pattern-design overhead. Choosing the right window/global/random mix is per-task tuning.
- Loss-of-quality vs full attention. Even NSA, the strongest variant, gives up some quality for the compute savings.
For most production deployments, the verdict is: use sliding-window for the bulk of attention layers, full attention for a few critical layers, and structured-sparse mechanisms (NSA, MoBA) when training a new model from scratch.
What to read next
- Long-Context — the broader engineering story.
- Mamba — the linear-recurrent alternative.
- Efficient Attention — the predecessor wave.