When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Hardware & Efficiency · Depth: Expert, quick

Summary

A new MatMul-based algorithm addresses the matrix inversion bottleneck in chunk-wise parallel linear attention, critical for long-context modeling on NPUs. This method, detailed in "When Good Enough Is Optimal," employs a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. It extends to low-bits INT by mitigating dynamic range expansion from repeated matrix power operations. The algorithm adapts approximation order and residual steps to chunk size, minimizing computational cost while preserving accuracy. Experiments on Qwen3.5-family models show up to 5x kernel-level speedup and a 20% reduction in decode-layer overhead, maintaining accuracy under both floating-point and low-precision inference.

Key takeaway

For Machine Learning Engineers optimizing long-context models on NPUs, this multiplication-only matrix inversion approximation offers a critical performance improvement. You should investigate integrating this MatMul-based approach into your linear attention implementations to achieve up to 5x kernel-level speedups and reduce decode-layer overhead by 20%, especially when targeting low-precision inference. This method provides a hardware-friendly solution for scalable linear attention.

Key insights

Multiplication-only matrix inversion approximation significantly speeds up linear attention on NPUs by eliminating sequential dependencies.

Principles

Neumann-series terms grow rapidly.
Inverse matrix shows diagonal concentration.
Adapt approximation order to chunk size.

Method

A truncated Neumann expansion with structural masking and parallel residual correction, extended to low-bits INT by mitigating dynamic range expansion from repeated matrix power operations.

In practice

Apply to strictly lower-triangular matrices.
Optimize for chunk-wise linear attention.
Mitigate dynamic range expansion for INT.

Topics

Matrix Inversion
Linear Attention
Quantization
NPUs
Neumann Expansion
Qwen3.5 Models

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.