Nemotron 3 Super: 1M Tokens, Small KV Cache

2026-03-13 · Source: The Kaitchup – AI on a Budget · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, medium

Summary

NVIDIA has released Nemotron 3 Super, a new 120-billion-parameter Mixture-of-Experts (MoE) model featuring a hybrid Mamba2–full-attention architecture. This model uses an 88-layer stack that interleaves Mamba-2, LatentMoE, and attention layers, designed to balance global recall with efficient sequence processing. A key innovation is LatentMoE, which compresses tokens into a lower-dimensional latent space before routing, reducing communication overhead and allowing for 512 experts per layer with top-22 activation. Nemotron 3 Super was pre-trained on 25 trillion tokens using NVFP4 format, with a dedicated long-context extension phase enabling a native 1M token context length and a minimal 7 GB KV cache for 1M tokens. Post-training involved supervised fine-tuning on 7 million samples and reinforcement learning across 21 environments.

Key takeaway

For AI architects and engineers evaluating large language models for long-context applications, Nemotron 3 Super offers a compelling blend of efficiency and accuracy. Its hybrid Mamba2-attention architecture and LatentMoE design enable a 1M token context with a minimal KV cache. You should consider its performance in scenarios demanding extensive context windows, but also compare its quantized performance against smaller, similarly quantized models like Qwen3.5 27B to ensure optimal resource allocation.

Key insights

Nemotron 3 Super combines Mamba2, LatentMoE, and attention for efficient, long-context processing.

Principles

Hybrid architectures balance attention's precision with SSMs' scaling.
LatentMoE reduces communication overhead in MoE models.
Dedicated long-context training improves practical usability.

Method

Nemotron 3 Super employs an 88-layer hybrid stack of Mamba-2, LatentMoE, and attention layers. Tokens are compressed into a latent space for MoE routing, then projected back. Pre-training uses NVFP4, followed by SFT and RL.

In practice

Consider for tasks requiring very long context windows.
Explore its open training recipe for architectural insights.
Benchmark against Qwen3.5 27B and 122B in NVFP4.

Topics

Nemotron 3 Super
Mixture-of-Experts
Hybrid Mamba Architecture
Long Context LLMs
NVFP4 Quantization

Best for: Research Scientist, AI Architect, AI Engineer, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Kaitchup – AI on a Budget.