When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet
Summary
A new MatMul-based algorithm addresses the matrix inversion bottleneck in chunk-wise parallel linear attention, critical for long-context modeling on NPUs. This method, detailed in "When Good Enough Is Optimal," employs a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. It extends to low-bits INT by mitigating dynamic range expansion from repeated matrix power operations. The algorithm adapts approximation order and residual steps to chunk size, minimizing computational cost while preserving accuracy. Experiments on Qwen3.5-family models show up to 5x kernel-level speedup and a 20% reduction in decode-layer overhead, maintaining accuracy under both floating-point and low-precision inference.
Key takeaway
For Machine Learning Engineers optimizing long-context models on NPUs, this multiplication-only matrix inversion approximation offers a critical performance improvement. You should investigate integrating this MatMul-based approach into your linear attention implementations to achieve up to 5x kernel-level speedups and reduce decode-layer overhead by 20%, especially when targeting low-precision inference. This method provides a hardware-friendly solution for scalable linear attention.
Key insights
Multiplication-only matrix inversion approximation significantly speeds up linear attention on NPUs by eliminating sequential dependencies.
Principles
- Neumann-series terms grow rapidly.
- Inverse matrix shows diagonal concentration.
- Adapt approximation order to chunk size.
Method
A truncated Neumann expansion with structural masking and parallel residual correction, extended to low-bits INT by mitigating dynamic range expansion from repeated matrix power operations.
In practice
- Apply to strictly lower-triangular matrices.
- Optimize for chunk-wise linear attention.
- Mitigate dynamic range expansion for INT.
Topics
- Matrix Inversion
- Linear Attention
- Quantization
- NPUs
- Neumann Expansion
- Qwen3.5 Models
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.