MiniMax Sparse Attention
Summary
MiniMax Sparse Attention (MSA) is a novel blockwise sparse attention mechanism built upon Grouped Query Attention (GQA), designed to enable ultra-long-context capabilities for frontier LLMs while mitigating the quadratic cost of softmax attention. MSA features a lightweight Index Branch that scores key-value blocks and selects a Top-$k$ subset for each GQA group, with a Main Branch then performing exact block-sparse attention over only these selected blocks. Co-designed with a GPU execution path, MSA achieves practical speedups, including \$14.2\times$ prefill and \$7.6\times$ decoding wall-clock speedups on H800 GPUs at 1M context. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA, reducing per-token attention compute by \$28.4\times$ at 1M context.
Key takeaway
For AI Engineers building or deploying LLMs with ultra-long context requirements, MiniMax Sparse Attention offers a compelling solution to overcome quadratic attention costs. You should evaluate integrating MSA, especially for agentic workflows or repository-scale code reasoning, given its demonstrated \$14.2\times$ prefill and \$7.6\times$ decoding speedups on H800, while maintaining model quality on a 109B-parameter model. Consider leveraging the open-source kernel and pretrained model for immediate application.
Key insights
MSA enables ultra-long LLM contexts by efficiently sparsifying attention with a two-branch, block-level selection mechanism.
Principles
- Simplicity and scalability are key for efficient GPU deployment.
- Co-designing algorithms with GPU execution paths translates theoretical sparsity into practical speedups.
- KL alignment loss and warmup stabilize sparse attention training.
Method
MSA uses an Index Branch for group-specific Top-$k$ block selection and a Main Branch for attention. It employs exp-free Top-$k$ selection and KV-outer sparse attention for GPU efficiency.
In practice
- Deploy MSA for LLMs requiring contexts up to 1M tokens.
- Utilize the provided inference kernel for H800 GPU speedups.
- Consider native sparse pretraining for optimal multimodal performance.
Topics
- Sparse Attention
- Long-Context LLMs
- Grouped Query Attention
- GPU Kernels
- Multimodal Models
- Inference Optimization
Code references
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.