Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

· Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, medium

Summary

Together AI has become the preferred cloud partner for MiniMax M3, a new multimodal model featuring a 1M-token context window and native multimodal reasoning. Upon its open-weights release, Together AI will host M3 as a developer endpoint. Together AI's Inference and Kernel teams implemented significant engineering breakthroughs to efficiently serve M3, achieving 81-125% throughput improvements across various concurrency levels. Key optimizations include a KV-Block-Major sparse attention kernel, a novel paged attention integration for MiniMax Sparse Attention (MSA), an optimized decode index scoring kernel, and a Rust-based multimodal preprocessing gateway (SMG). MSA, a core architectural change, reduces attention computation cost by capping tokens each query attends to, leading to over 9x speedup in prefilling and 15x in decoding. The SMG handles all vision preprocessing on the CPU, freeing GPU resources.

Key takeaway

For AI Engineers deploying large multimodal models with extensive context windows, you should prioritize deep systems-level optimizations. Consider implementing sparse attention kernels and offloading multimodal preprocessing to a CPU-based gateway like SMG. This approach can yield substantial throughput gains, as demonstrated by Together AI's 81-125% improvements for MiniMax M3, making 1M-token context and multimodality economically viable for production. Evaluate your inference stack for similar kernel and preprocessing opportunities.

Key insights

Efficiently serving large multimodal models with long contexts requires deep systems-level optimization across multiple architectural components.

Principles

Method

Together AI's method involved developing a KV-Block-Major sparse attention kernel, integrating MSA with paged attention, optimizing decode index scoring, and implementing a Rust-based multimodal preprocessing gateway.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.