When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet
Summary
A new MatMul-based algorithm addresses the matrix inversion bottleneck in chunk-wise parallel linear attention, a critical issue for long-context modeling on NPUs, particularly in GatedDeltaNet. The method employs a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies and mitigate dynamic range expansion in low-bit integer quantization. Experiments on Qwen3.5-family models demonstrate up to 5× kernel-level speedup and a 20% reduction in decode-layer overhead. This approach preserves model accuracy under both floating-point and low-precision inference, offering an efficient and hardware-friendly solution for scalable linear attention. For a chunk size of k=64, a Neumann series order N=3 and S=8 residual correction steps were used.
Key takeaway
For AI Scientists and Machine Learning Engineers optimizing LLMs for NPU deployment, consider adopting this MatMul-based matrix inversion approximation. You can achieve significant kernel speedups (up to 5×) and reduce decode overhead by 20% in models like Qwen3.5, even under INT16 quantization, without sacrificing accuracy. This enables more efficient long-context model deployment on resource-constrained edge devices.
Key insights
Approximating matrix inversion with MatMul-only operations is optimal for hardware efficiency and low-bit quantization in linear attention.
Principles
- Exact matrix inversion is often unnecessary for practical accuracy.
- Inverse energy concentrates near the main diagonal of the matrix.
- Low-order Neumann truncation requires residual correction for stability.
Method
The proposed method uses a truncated Neumann expansion, applies a diagonal mask to control outliers, and refines the approximation with parallel residual correction to eliminate sequential dependencies.
In practice
- Limit Neumann truncation order (e.g., N=3 for 64×64 matrices).
- Use 2-3 residual correction iterations for convergence.
- Adapt approximation order to chunk size for cost/accuracy balance.
Topics
- Matrix Inversion
- Linear Attention
- Quantization
- GatedDeltaNet
- NPU Optimization
- Neumann Series
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.