MiniMax Sparse Attention

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

MiniMax Sparse Attention (MSA) is a novel blockwise sparse attention mechanism built upon Grouped Query Attention (GQA), designed to enable ultra-long-context capabilities for frontier LLMs while mitigating the quadratic cost of softmax attention. MSA features a lightweight Index Branch that scores key-value blocks and selects a Top-$k$ subset for each GQA group, with a Main Branch then performing exact block-sparse attention over only these selected blocks. Co-designed with a GPU execution path, MSA achieves practical speedups, including \$14.2\times$ prefill and \$7.6\times$ decoding wall-clock speedups on H800 GPUs at 1M context. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA, reducing per-token attention compute by \$28.4\times$ at 1M context.

Key takeaway

For AI Engineers building or deploying LLMs with ultra-long context requirements, MiniMax Sparse Attention offers a compelling solution to overcome quadratic attention costs. You should evaluate integrating MSA, especially for agentic workflows or repository-scale code reasoning, given its demonstrated \$14.2\times$ prefill and \$7.6\times$ decoding speedups on H800, while maintaining model quality on a 109B-parameter model. Consider leveraging the open-source kernel and pretrained model for immediate application.

Key insights

MSA enables ultra-long LLM contexts by efficiently sparsifying attention with a two-branch, block-level selection mechanism.

Principles

Method

MSA uses an Index Branch for group-specific Top-$k$ block selection and a Main Branch for attention. It employs exp-free Top-$k$ selection and KV-outer sparse attention for GPU efficiency.

In practice

Topics

Code references

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.