NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, quick

Summary

NVIDIA AI has released Star Elastic, a novel post-training method applied to Nemotron Nano v3 that enables zero-shot extraction of 23B and 12B submodels from a single 30B parent checkpoint. This unified checkpoint is available in BF16, FP8, and NVFP4 formats. A learnable router, trained with Gumbel-Softmax, dynamically maps parameter budgets to optimal nested configurations across various elastic axes, including attention heads and FFN channels. A key finding is the ability to assign the 23B submodel for a "thinking phase" and the 30B model for the "final answer," yielding a +16% accuracy improvement and 1.9x lower latency compared to standard budget control on benchmarks like AIME-2025 and MMLU-Pro. This approach significantly reduces token costs, offering 360x fewer tokens than pretraining variants from scratch and 7x fewer than sequential compression, while maintaining or exceeding the performance of independently trained baselines. The 12B NVFP4 variant demonstrates enhanced hardware accessibility, running on an RTX 5080 and achieving 7,426 tokens/s on an RTX Pro 6000, which is 3.4x the throughput of the 30B BF16 baseline.

Key takeaway

For AI Engineers optimizing large language model inference, Star Elastic offers a compelling strategy to improve accuracy and reduce latency and cost. By dynamically assigning submodels based on task phase (e.g., smaller for reasoning, larger for final answers), you can achieve significant performance gains and hardware accessibility. Consider integrating Star Elastic's single-checkpoint, multi-model approach to maximize GPU utilization and reduce operational expenses, especially for complex reasoning tasks.

Key insights

Star Elastic enables dynamic model sizing within a single checkpoint for optimized inference performance and resource utilization.

Principles

Method

A learnable router, trained via Gumbel-Softmax, maps target parameter budgets to optimal nested configurations across elastic axes like attention heads and FFN channels, based on pre-computed importance rankings.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.