NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents
Summary
NVIDIA has released Nemotron 3 Ultra, an open 550B Mixture-of-Experts model (55B active per token) engineered for long-running agents. Its hybrid Mamba-Attention MoE architecture includes 108 layers, 512 experts per layer, and top-22 routing, supporting a 1M-token context window with flat decode costs. The model shows high efficiency, achieving 5.9x throughput versus GLM-5.1 (8K in / 64K out, NVFP4 on GB200) and approximately 30% lower task completion costs. A medium-effort mode reduces tokens by 2.5x for about 7% accuracy. Training utilizes Multi-teacher On-Policy Distillation (MOPD) across an SFT → RLVR → MOPD → MTP Boosting pipeline. Performance on held-out gates includes PinchBench 90.0, SWE-Bench Verified 71.9, RULER u /1M context 94.7, and 78.7 for non-hallucination on AA-Omniscience. Weights, data, and recipes are open under OpenMDW-1.1.
Key takeaway
For AI Engineers building long-running agents, Nemotron 3 Ultra offers a compelling open-source foundation. Its Mamba-Attention hybrid architecture ensures decode costs remain flat even with 1M-token contexts, addressing a critical limitation in other models. You should evaluate its 5.9x throughput and 30% lower task completion costs for your agentic workflows. Consider utilizing its open weights and recipes for fine-tuning to your specific application needs.
Key insights
Nemotron 3 Ultra is an open hybrid Mamba-Transformer MoE designed for efficient, long-context agentic AI with flat decode costs.
Principles
- Hybrid Mamba-Attention maintains flat decode costs.
- Multi-teacher distillation improves student model performance.
- Open weights foster model refinement.
Method
Post-training uses Multi-teacher On-Policy Distillation (MOPD) via an SFT → RLVR → MOPD → MTP Boosting pipeline to distill 10+ specialized teachers into one student model.
In practice
- Deploy for long-running AI agents needing 1M-token context.
- Utilize open weights for custom fine-tuning.
- Explore medium-effort mode for cost optimization.
Topics
- Nemotron 3 Ultra
- Mixture-of-Experts
- Mamba-Attention Architecture
- Long-Context AI Agents
- OpenMDW-1.1 License
- Model Distillation
Best for: AI Architect, NLP Engineer, CTO, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.