NVIDIA AI Releases Nemotron 3 Ultra: An Open 550B Mixture-of-Experts Hybrid Mamba-Transformer for Long-Running Agents

2026-06-04 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

NVIDIA has released Nemotron 3 Ultra, an open 550B Mixture-of-Experts model (55B active per token) engineered for long-running agents. Its hybrid Mamba-Attention MoE architecture includes 108 layers, 512 experts per layer, and top-22 routing, supporting a 1M-token context window with flat decode costs. The model shows high efficiency, achieving 5.9x throughput versus GLM-5.1 (8K in / 64K out, NVFP4 on GB200) and approximately 30% lower task completion costs. A medium-effort mode reduces tokens by 2.5x for about 7% accuracy. Training utilizes Multi-teacher On-Policy Distillation (MOPD) across an SFT → RLVR → MOPD → MTP Boosting pipeline. Performance on held-out gates includes PinchBench 90.0, SWE-Bench Verified 71.9, RULER u /1M context 94.7, and 78.7 for non-hallucination on AA-Omniscience. Weights, data, and recipes are open under OpenMDW-1.1.

Key takeaway

For AI Engineers building long-running agents, Nemotron 3 Ultra offers a compelling open-source foundation. Its Mamba-Attention hybrid architecture ensures decode costs remain flat even with 1M-token contexts, addressing a critical limitation in other models. You should evaluate its 5.9x throughput and 30% lower task completion costs for your agentic workflows. Consider utilizing its open weights and recipes for fine-tuning to your specific application needs.

Key insights

Nemotron 3 Ultra is an open hybrid Mamba-Transformer MoE designed for efficient, long-context agentic AI with flat decode costs.

Principles

Hybrid Mamba-Attention maintains flat decode costs.
Multi-teacher distillation improves student model performance.
Open weights foster model refinement.

Method

Post-training uses Multi-teacher On-Policy Distillation (MOPD) via an SFT → RLVR → MOPD → MTP Boosting pipeline to distill 10+ specialized teachers into one student model.

In practice

Deploy for long-running AI agents needing 1M-token context.
Utilize open weights for custom fine-tuning.
Explore medium-effort mode for cost optimization.

Topics

Nemotron 3 Ultra
Mixture-of-Experts
Mamba-Attention Architecture
Long-Context AI Agents
OpenMDW-1.1 License
Model Distillation

Best for: AI Architect, NLP Engineer, CTO, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.