Mamba-3
Summary
Mamba-3 is a new state space model (SSM) released on March 17, 2026, by researchers from CMU, Princeton, Cartesia AI, and Together AI, prioritizing inference efficiency over training speed. This model introduces a more expressive recurrence formula, complex-valued state tracking, and a Multi-Input, Multi-Output (MIMO) variant to enhance accuracy without increasing decoding latency. Empirically, the Mamba-3 SISO variant outperforms Mamba-2, Gated DeltaNet, and Llama-3.2-1B (Transformer) in prefill+decode latency across all sequence lengths at the 1.5B scale on an H100-SXM 80GB GPU with a batch size of 128. The MIMO variant further boosts accuracy by over 1 percentage point at the 1B scale, maintaining comparable decoding speeds. The team also open-sourced highly optimized kernels built with Triton, TileLang, and CuTe DSL.
Key takeaway
For MLOps Engineers or AI Scientists focused on deploying large language models, Mamba-3 offers a significant inference efficiency upgrade. You should evaluate Mamba-3 SISO for applications demanding the lowest prefill+decode latency, especially at the 1.5B scale. If higher accuracy is critical, Mamba-3 MIMO provides a performance boost with comparable decoding speeds. Utilize the open-sourced Triton, TileLang, and CuTe DSL kernels to maximize hardware performance on Hopper GPUs.
Key insights
Mamba-3 optimizes SSMs for inference efficiency through enhanced recurrence, complex states, and MIMO, surpassing prior models in speed and accuracy.
Principles
- Inference-first design can improve model quality and speed.
- Fixed-size states require richer dynamics for expressivity.
- Hybrid models combine linear layers with self-attention.
Method
Mamba-3 enhances SSMs via an exponential-trapezoidal discretization scheme, complex-valued state tracking, and multi-input, multi-output (MIMO) SSMs.
In practice
- Use Mamba-3 SISO for fastest prefill+decode latency.
- Consider Mamba-3 MIMO for higher accuracy at similar speeds.
- Integrate open-sourced kernels for hardware-optimized performance.
Topics
- Mamba-3
- State Space Models
- Inference Optimization
- Kernel Development
- Language Model Latency
- Model Architecture
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.