Nemotron 3 Super: 1M Tokens, Small KV Cache
Summary
NVIDIA has released Nemotron 3 Super, a new 120-billion-parameter Mixture-of-Experts (MoE) model featuring a hybrid Mamba2–full-attention architecture. This model uses an 88-layer stack that interleaves Mamba-2, LatentMoE, and attention layers, designed to balance global recall with efficient sequence processing. A key innovation is LatentMoE, which compresses tokens into a lower-dimensional latent space before routing, reducing communication overhead and allowing for 512 experts per layer with top-22 activation. Nemotron 3 Super was pre-trained on 25 trillion tokens using NVFP4 format, with a dedicated long-context extension phase enabling a native 1M token context length and a minimal 7 GB KV cache for 1M tokens. Post-training involved supervised fine-tuning on 7 million samples and reinforcement learning across 21 environments.
Key takeaway
For AI architects and engineers evaluating large language models for long-context applications, Nemotron 3 Super offers a compelling blend of efficiency and accuracy. Its hybrid Mamba2-attention architecture and LatentMoE design enable a 1M token context with a minimal KV cache. You should consider its performance in scenarios demanding extensive context windows, but also compare its quantized performance against smaller, similarly quantized models like Qwen3.5 27B to ensure optimal resource allocation.
Key insights
Nemotron 3 Super combines Mamba2, LatentMoE, and attention for efficient, long-context processing.
Principles
- Hybrid architectures balance attention's precision with SSMs' scaling.
- LatentMoE reduces communication overhead in MoE models.
- Dedicated long-context training improves practical usability.
Method
Nemotron 3 Super employs an 88-layer hybrid stack of Mamba-2, LatentMoE, and attention layers. Tokens are compressed into a latent space for MoE routing, then projected back. Pre-training uses NVFP4, followed by SFT and RL.
In practice
- Consider for tasks requiring very long context windows.
- Explore its open training recipe for architectural insights.
- Benchmark against Qwen3.5 27B and 122B in NVFP4.
Topics
- Nemotron 3 Super
- Mixture-of-Experts
- Hybrid Mamba Architecture
- Long Context LLMs
- NVFP4 Quantization
Best for: Research Scientist, AI Architect, AI Engineer, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.