MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

MAGE is a novel dynamic sparse attention method designed for block diffusion LLMs, addressing the KV cache memory bottleneck in long-context scenarios. It leverages the unique observation that attention patterns from the initial All-[MASK]block reliably predict important KV entries and budget requirements for subsequent denoising steps. This enables MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Evaluated on LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with reduced KV budget, delivering up to 3–4x end-to-end speedup over dense attention and outperforming AR-oriented sparse attention baselines like Quest and Tidal. A lightweight fine-tuning strategy, requiring only 100-200 steps on a single NVIDIA H100 GPU for 1.5B and 7B models, further enhances MAGE's performance, often surpassing exact attention.

Key takeaway

For MLOps Engineers deploying block diffusion LLMs in long-context applications, MAGE offers a significant performance improvement by reducing KV cache memory access. You should consider integrating MAGE to achieve up to 3–4x inference speedup while maintaining or even improving accuracy compared to dense attention. Evaluate MAGE's training-free version first, then explore its lightweight fine-tuning for further gains, requiring minimal GPU resources.

Key insights

MAGE uses initial All-[MASK]block attention to guide sparse KV selection, achieving efficient block diffusion LLM inference.

Principles

Method

MAGE computes exact attention on the first All-[MASK]block to identify critical KV indices with layer-adaptive budgets, then reuses these indices for all subsequent sparse denoising steps. Fine-tuning uses self-distillation with exact attention as a teacher.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.