Linear Attention — Hyena, RetNet
The 2023 wave of "linear attention" architectures — Hyena, RetNet, GLA, and others — revisited the efficient-attention idea with a sharper pitch: Transformer-quality at
Linear attention, from the kernel view
Recall that scaled dot-product attention is
If we replace the softmax-of-dot-product kernel with any kernel
By associativity, the inner sums
The catch: linear attention is generally less expressive than softmax attention. The softmax's hard selection between strongly- and weakly-matching keys is structurally lost, and naive linear attention underperforms.
Hyena (2023)
Hyena Hierarchy: Towards Larger Convolutional Language Models (Poli, Massaroli, Nguyen et al., ICML 2023). Hyena replaces attention with a stack of long-range convolutions parameterised in the frequency domain — closely related to S4 but with a different parametrisation.
The Hyena operator, simplified:
- For each layer, three projections
as in Transformers. - Replace the
similarity with element-wise multiplication + a long convolution with implicitly-parameterised kernel. - Output is
, where is the long-range convolution kernel.
The long convolutions are parameterised by a small MLP indexed by position, then evaluated efficiently in the frequency domain via FFT. The result is sub-quadratic (
Hyena models match Transformers at sequence-modelling perplexity at the small-medium scale (155M-1B). At frontier scale, the architecture has been less explored.
RetNet — Retentive Network
Retentive Network: A Successor to Transformer for Large Language Models (Sun, Dong et al., Microsoft 2023). RetNet's contribution is offering three computational views of the same recurrent operation:
- Parallel view — for training, similar cost to attention.
- Recurrent view — for inference,
per-token state, like an RNN. - Chunkwise view — for long-context training,
scaling.
The retention operator:
The matrix
Empirically, RetNet matched Transformers at modest scale and offered substantial inference-throughput improvements due to the constant-state recurrent view. Its production deployment has been smaller than Mamba's, but the multi-view-of-the-same-op idea recurred in later work.
GLA — Gated Linear Attention
Gated Linear Attention Transformers with Hardware-Efficient Training (Yang, Wang, Hu, Wang, Zhang et al., ICML 2024). GLA adds input-dependent gating to linear attention — the same selectivity idea that made Mamba competitive applied to linear attention. Each step, gates control how much of the running state is retained.
GLA's recurrence is roughly:
with
GLA is the architecture in DeltaNet, RWKV-7, and several frontier hybrid models in 2024-25.
What linear attention is for
The linear-attention family is winning on:
- Long-sequence efficiency.
training and per-token inference are real wins past 100K-token contexts. - Streaming inference. Constant-state recurrent inference suits voice agents, real-time analytics, low-latency agents.
- Hybrid architectures. Modern hybrid stacks (Jamba, Zamba, Granite Mamba) interleave Transformer attention with linear-attention or SSM blocks.
It's not yet winning on:
- Frontier-scale pure architectures. Pure linear-attention models at 100B+ are still rare.
- In-context learning. Linear-attention ICL lags Transformer ICL at comparable scale, though closing.
- Exact retrieval. Constant-size state cannot precisely recall arbitrary earlier tokens.
Linear attention vs SSMs vs Transformers
The 2024-25 architectural landscape has three challengers:
- Transformers — expressive, ICL-strong, ecosystem-mature, but
. - SSMs (Mamba/Mamba-2) — efficient, content-aware, state-of-the-art for long sequences, but ICL-weaker.
- Linear attention (GLA, RWKV-7) — efficient, similar trade-offs to SSMs, mathematically dual to selective SSMs in some forms.
The dominant production answer in 2025 is hybrids: Transformer attention layers (for retrieval and ICL) plus linear-recurrent layers (for efficiency on long contexts).
What to read next
- Mamba — selective SSMs, mathematically dual to selective linear attention.
- RWKV — another linear-recurrent Transformer alternative.
- Efficient Attention — the predecessor wave.