MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

MAGE is a novel dynamic sparse attention method designed for block diffusion LLMs, addressing the KV cache memory bottleneck in long-context scenarios. It leverages the unique observation that attention patterns from the initial All-[MASK]block reliably predict important KV entries and budget requirements for subsequent denoising steps. This enables MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Evaluated on LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with reduced KV budget, delivering up to 3–4x end-to-end speedup over dense attention and outperforming AR-oriented sparse attention baselines like Quest and Tidal. A lightweight fine-tuning strategy, requiring only 100-200 steps on a single NVIDIA H100 GPU for 1.5B and 7B models, further enhances MAGE's performance, often surpassing exact attention.

Key takeaway

For MLOps Engineers deploying block diffusion LLMs in long-context applications, MAGE offers a significant performance improvement by reducing KV cache memory access. You should consider integrating MAGE to achieve up to 3–4x inference speedup while maintaining or even improving accuracy compared to dense attention. Evaluate MAGE's training-free version first, then explore its lightweight fine-tuning for further gains, requiring minimal GPU resources.

Key insights

MAGE uses initial All-[MASK]block attention to guide sparse KV selection, achieving efficient block diffusion LLM inference.

Principles

All-[MASK]block attention reliably predicts critical KV entries (84–90% recall).
Layer-wise attention score skewness remains stable across denoising steps.
Overlapping index selection with attention computation hides overhead.

Method

MAGE computes exact attention on the first All-[MASK]block to identify critical KV indices with layer-adaptive budgets, then reuses these indices for all subsequent sparse denoising steps. Fine-tuning uses self-distillation with exact attention as a teacher.

In practice

Apply MAGE for 3–4x speedup in block diffusion LLMs.
Fine-tune MAGE with 100-200 steps on H100 for accuracy gains.
Utilize FlashInfer for efficient sparse attention kernels.

Topics

Block Diffusion LLMs
Sparse Attention
KV Cache Optimization
Long-Context Inference
Model Fine-tuning
NVIDIA H100

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.