Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Summary
Nemotron-Cascade 2 is an open 30B Mixture-of-Experts (MoE) model with 3B activated parameters, designed to deliver best-in-class reasoning and strong agentic capabilities. This model achieves mathematical and coding reasoning performance comparable to frontier open models, despite its compact size. It is the second open-weight LLM, following DeepSeekV3.2-Speciale-671B-A37B, to reach Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), International Olympiad in Informatics (IOI), and ICPC World Finals. Nemotron-Cascade 2 demonstrates high intelligence density with 20x fewer parameters than its predecessor. Key advancements include expanding Cascade RL to cover a broader range of reasoning and agentic domains after Supervised Fine-Tuning (SFT) on a curated dataset, and introducing multi-domain on-policy distillation from strong intermediate teacher models during the Cascade RL process to recover benchmark regressions and sustain performance.
Key takeaway
For AI scientists and NLP engineers developing highly capable yet compact language models, Nemotron-Cascade 2 demonstrates that advanced reasoning and agentic capabilities are achievable with significantly fewer parameters. You should investigate integrating expanded Cascade RL and multi-domain on-policy distillation into your training pipelines to enhance performance and efficiency, especially for competitive benchmarks like the IMO or IOI.
Key insights
Nemotron-Cascade 2 achieves frontier-level reasoning with 20x fewer parameters via expanded Cascade RL and multi-domain distillation.
Principles
- Compact models can achieve high intelligence density.
- Cascade RL can be expanded for broader domain coverage.
Method
The method involves Supervised Fine-Tuning (SFT) on a curated dataset, followed by expanded Cascade RL across diverse domains, and multi-domain on-policy distillation from intermediate teacher models.
In practice
- Utilize MoE architectures for parameter efficiency.
- Employ distillation to maintain performance during RL training.
Topics
- Nemotron-Cascade 2
- Mixture-of-Experts
- Cascade RL
- On-Policy Distillation
- Large Language Models
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.