Exact Linear Attention

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Exact Linear Attention (ELA) is a novel mechanism for Transformer attention that achieves linear computational complexity, O(L), without approximation errors. It leverages the exact decomposition property of kernel functions, addressing prior linear attention's gradient explosion and token attention dilution issues through kernel constraints ensuring non-negativity, discriminability, and geometric interpretability. The paper introduces kernels like the Hadamard Exp Kernel and engineering innovations including a Hyper-Link structure to mitigate gradient degradation, a Memory Lobe module for qualitative memory and implicit reinforcement learning, and a routing-score-based bias for Mixture-of-Experts. ELA demonstrates up to 6× faster decoding speed and a 75% reduction in KV cache memory compared to full attention, while maintaining comparable or superior training performance. It enables scaling Transformers to ultra-long sequences, exemplified by MiniMax's 4 million token context window.

Key takeaway

For Machine Learning Engineers and AI Architects scaling Transformer models to ultra-long sequences, Exact Linear Attention (ELA) presents a compelling solution. You should evaluate ELA's kernel-based approach and its Hyper-Link and Memory Lobe innovations to achieve up to 6× faster decoding and 75% KV cache memory reduction. This enables processing context windows of millions of tokens, significantly improving efficiency and reducing infrastructure costs for large language models.

Key insights

Exact Linear Attention (ELA) uses kernel decomposition to achieve O(L) complexity without approximation, enhancing Transformer efficiency and scalability.

Principles

Kernel functions for linear attention must be exactly decomposable.
Ideal kernels require discriminability, non-negativity, and geometric interpretability.
Transformation flow captures layer-wise semantic evolution for qualitative memory.

Method

ELA decomposes kernel k(A_i, B_j) into φ(A_i)ψ(B_j)⁺, enabling summation order swap for O(L) attention computation and normalization without softmax.

In practice

Employ Hadamard Exp Kernel for multimodal feature co-activation.
Replace residual connections with Hyper-Link for gradient stability.
Integrate Memory Lobe to accelerate convergence and improve generalization.

Topics

Exact Linear Attention
Transformer Architecture
Kernel Functions
Long-Context LLMs
Mixture-of-Experts
Computational Efficiency

Code references

jingyaogong/minimind

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.