When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A new MatMul-based algorithm addresses the matrix inversion bottleneck in chunk-wise parallel linear attention, a critical issue for long-context modeling on NPUs, particularly in GatedDeltaNet. The method employs a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies and mitigate dynamic range expansion in low-bit integer quantization. Experiments on Qwen3.5-family models demonstrate up to 5× kernel-level speedup and a 20% reduction in decode-layer overhead. This approach preserves model accuracy under both floating-point and low-precision inference, offering an efficient and hardware-friendly solution for scalable linear attention. For a chunk size of k=64, a Neumann series order N=3 and S=8 residual correction steps were used.

Key takeaway

For AI Scientists and Machine Learning Engineers optimizing LLMs for NPU deployment, consider adopting this MatMul-based matrix inversion approximation. You can achieve significant kernel speedups (up to 5×) and reduce decode overhead by 20% in models like Qwen3.5, even under INT16 quantization, without sacrificing accuracy. This enables more efficient long-context model deployment on resource-constrained edge devices.

Key insights

Approximating matrix inversion with MatMul-only operations is optimal for hardware efficiency and low-bit quantization in linear attention.

Principles

Exact matrix inversion is often unnecessary for practical accuracy.
Inverse energy concentrates near the main diagonal of the matrix.
Low-order Neumann truncation requires residual correction for stability.

Method

The proposed method uses a truncated Neumann expansion, applies a diagonal mask to control outliers, and refines the approximation with parallel residual correction to eliminate sequential dependencies.

In practice

Limit Neumann truncation order (e.g., N=3 for 64×64 matrices).
Use 2-3 residual correction iterations for convergence.
Adapt approximation order to chunk size for cost/accuracy balance.

Topics

Matrix Inversion
Linear Attention
Quantization
GatedDeltaNet
NPU Optimization
Neumann Series

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.