Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets
Summary
Together AI has become the preferred cloud partner for MiniMax M3, a new multimodal model featuring a 1M-token context window and native multimodal reasoning. Upon its open-weights release, Together AI will host M3 as a developer endpoint. Together AI's Inference and Kernel teams implemented significant engineering breakthroughs to efficiently serve M3, achieving 81-125% throughput improvements across various concurrency levels. Key optimizations include a KV-Block-Major sparse attention kernel, a novel paged attention integration for MiniMax Sparse Attention (MSA), an optimized decode index scoring kernel, and a Rust-based multimodal preprocessing gateway (SMG). MSA, a core architectural change, reduces attention computation cost by capping tokens each query attends to, leading to over 9x speedup in prefilling and 15x in decoding. The SMG handles all vision preprocessing on the CPU, freeing GPU resources.
Key takeaway
For AI Engineers deploying large multimodal models with extensive context windows, you should prioritize deep systems-level optimizations. Consider implementing sparse attention kernels and offloading multimodal preprocessing to a CPU-based gateway like SMG. This approach can yield substantial throughput gains, as demonstrated by Together AI's 81-125% improvements for MiniMax M3, making 1M-token context and multimodality economically viable for production. Evaluate your inference stack for similar kernel and preprocessing opportunities.
Key insights
Efficiently serving large multimodal models with long contexts requires deep systems-level optimization across multiple architectural components.
Principles
- Sparse attention significantly reduces long-context computation.
- Offload multimodal preprocessing to CPU gateways.
- Optimize kernel execution for specific attention architectures.
Method
Together AI's method involved developing a KV-Block-Major sparse attention kernel, integrating MSA with paged attention, optimizing decode index scoring, and implementing a Rust-based multimodal preprocessing gateway.
In practice
- Use KV-Block-Major for sparse attention.
- Implement Rust gateway for multimodal preprocessing.
- Adapt paged attention for sparse models.
Topics
- MiniMax M3
- Multimodal AI
- Sparse Attention
- Inference Optimization
- Paged Attention
- Together AI
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.