Mamba-3

2026-06-09 · Source: Together AI | The AI Native Cloud - Together.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

Mamba-3 is a new state space model (SSM) released on March 17, 2026, by researchers from CMU, Princeton, Cartesia AI, and Together AI, prioritizing inference efficiency over training speed. This model introduces a more expressive recurrence formula, complex-valued state tracking, and a Multi-Input, Multi-Output (MIMO) variant to enhance accuracy without increasing decoding latency. Empirically, the Mamba-3 SISO variant outperforms Mamba-2, Gated DeltaNet, and Llama-3.2-1B (Transformer) in prefill+decode latency across all sequence lengths at the 1.5B scale on an H100-SXM 80GB GPU with a batch size of 128. The MIMO variant further boosts accuracy by over 1 percentage point at the 1B scale, maintaining comparable decoding speeds. The team also open-sourced highly optimized kernels built with Triton, TileLang, and CuTe DSL.

Key takeaway

For MLOps Engineers or AI Scientists focused on deploying large language models, Mamba-3 offers a significant inference efficiency upgrade. You should evaluate Mamba-3 SISO for applications demanding the lowest prefill+decode latency, especially at the 1.5B scale. If higher accuracy is critical, Mamba-3 MIMO provides a performance boost with comparable decoding speeds. Utilize the open-sourced Triton, TileLang, and CuTe DSL kernels to maximize hardware performance on Hopper GPUs.

Key insights

Mamba-3 optimizes SSMs for inference efficiency through enhanced recurrence, complex states, and MIMO, surpassing prior models in speed and accuracy.

Principles

Inference-first design can improve model quality and speed.
Fixed-size states require richer dynamics for expressivity.
Hybrid models combine linear layers with self-attention.

Method

Mamba-3 enhances SSMs via an exponential-trapezoidal discretization scheme, complex-valued state tracking, and multi-input, multi-output (MIMO) SSMs.

In practice

Use Mamba-3 SISO for fastest prefill+decode latency.
Consider Mamba-3 MIMO for higher accuracy at similar speeds.
Integrate open-sourced kernels for hardware-optimized performance.

Topics

Mamba-3
State Space Models
Inference Optimization
Kernel Development
Language Model Latency
Model Architecture

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Together AI | The AI Native Cloud - Together.ai.