MiniMax Sparse Attention

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

MiniMax Sparse Attention (MSA) is a novel blockwise sparse attention mechanism built upon Grouped Query Attention (GQA), designed to enable ultra-long-context capabilities for frontier LLMs while mitigating the quadratic cost of softmax attention. MSA features a lightweight Index Branch that scores key-value blocks and selects a Top-$k$ subset for each GQA group, with a Main Branch then performing exact block-sparse attention over only these selected blocks. Co-designed with a GPU execution path, MSA achieves practical speedups, including \$14.2\times$ prefill and \$7.6\times$ decoding wall-clock speedups on H800 GPUs at 1M context. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA, reducing per-token attention compute by \$28.4\times$ at 1M context.

Key takeaway

For AI Engineers building or deploying LLMs with ultra-long context requirements, MiniMax Sparse Attention offers a compelling solution to overcome quadratic attention costs. You should evaluate integrating MSA, especially for agentic workflows or repository-scale code reasoning, given its demonstrated \$14.2\times$ prefill and \$7.6\times$ decoding speedups on H800, while maintaining model quality on a 109B-parameter model. Consider leveraging the open-source kernel and pretrained model for immediate application.

Key insights

MSA enables ultra-long LLM contexts by efficiently sparsifying attention with a two-branch, block-level selection mechanism.

Principles

Simplicity and scalability are key for efficient GPU deployment.
Co-designing algorithms with GPU execution paths translates theoretical sparsity into practical speedups.
KL alignment loss and warmup stabilize sparse attention training.

Method

MSA uses an Index Branch for group-specific Top-$k$ block selection and a Main Branch for attention. It employs exp-free Top-$k$ selection and KV-outer sparse attention for GPU efficiency.

In practice

Deploy MSA for LLMs requiring contexts up to 1M tokens.
Utilize the provided inference kernel for H800 GPU speedups.
Consider native sparse pretraining for optimal multimodal performance.

Topics

Sparse Attention
Long-Context LLMs
Grouped Query Attention
GPU Kernels
Multimodal Models
Inference Optimization

Code references

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.