Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets

2026-06-10 · Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, medium

Summary

Together AI has become the preferred cloud partner for MiniMax M3, a new multimodal model featuring a 1M-token context window and native multimodal reasoning. Upon its open-weights release, Together AI will host M3 as a developer endpoint. Together AI's Inference and Kernel teams implemented significant engineering breakthroughs to efficiently serve M3, achieving 81-125% throughput improvements across various concurrency levels. Key optimizations include a KV-Block-Major sparse attention kernel, a novel paged attention integration for MiniMax Sparse Attention (MSA), an optimized decode index scoring kernel, and a Rust-based multimodal preprocessing gateway (SMG). MSA, a core architectural change, reduces attention computation cost by capping tokens each query attends to, leading to over 9x speedup in prefilling and 15x in decoding. The SMG handles all vision preprocessing on the CPU, freeing GPU resources.

Key takeaway

For AI Engineers deploying large multimodal models with extensive context windows, you should prioritize deep systems-level optimizations. Consider implementing sparse attention kernels and offloading multimodal preprocessing to a CPU-based gateway like SMG. This approach can yield substantial throughput gains, as demonstrated by Together AI's 81-125% improvements for MiniMax M3, making 1M-token context and multimodality economically viable for production. Evaluate your inference stack for similar kernel and preprocessing opportunities.

Key insights

Efficiently serving large multimodal models with long contexts requires deep systems-level optimization across multiple architectural components.

Principles

Sparse attention significantly reduces long-context computation.
Offload multimodal preprocessing to CPU gateways.
Optimize kernel execution for specific attention architectures.

Method

Together AI's method involved developing a KV-Block-Major sparse attention kernel, integrating MSA with paged attention, optimizing decode index scoring, and implementing a Rust-based multimodal preprocessing gateway.

In practice

Use KV-Block-Major for sparse attention.
Implement Rust gateway for multimodal preprocessing.
Adapt paged attention for sparse models.

Topics

MiniMax M3
Multimodal AI
Sparse Attention
Inference Optimization
Paged Attention
Together AI

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.