Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer
Summary
NVIDIA has released the Nemotron 3 Ultra NVFP4 checkpoint, a quantized version of the 550B model optimized using NVIDIA Model Optimizer. This checkpoint utilizes NVFP4, a 4-bit floating-point format from the NVIDIA Blackwell architecture, to achieve up to 5.9x higher inference throughput compared to the GLM-5.1 754B FP4 model on decode-heavy tasks, while maintaining BF16 accuracy. The quantization process reduced the model's size from 1,121 GB in BF16 to 352.3 GB, a 3.2x reduction. A key innovation allows the checkpoint to run on both NVIDIA Hopper (W4A16) and Blackwell (W4A4) hardware. The development involved exploring scaling methods like max and MSE, ultimately favoring "four-over-six scaling" for its ability to minimize reconstruction error, which cut median reconstruction MSE by 16.4% for MoE expert layers. The optimal effective bits-per-element was determined to be 5.03. Quantization of the 550B model was parallelized using NVIDIA Megatron-LM, reducing calibration time from approximately 85 minutes to 9 minutes.
Key takeaway
For AI Engineers optimizing large language model inference, consider adopting NVIDIA's Nemotron 3 Ultra NVFP4 quantization strategy. This approach, utilizing NVIDIA Model Optimizer, can significantly boost throughput by up to 5.9x and reduce model footprint by 3.2x without sacrificing BF16 accuracy. You should explore adaptive scaling methods like "four-over-six" and utilize `auto_quantize` to find the optimal bits-per-element for your specific models, ensuring efficient deployment across diverse NVIDIA hardware.
Key insights
High-quality 4-bit quantization for large language models requires sophisticated, adaptive scaling and mixed-precision strategies.
Principles
- Mixed-precision quantization optimizes performance.
- Outlier-sensitive scaling degrades model accuracy.
- Adaptive scaling improves 4-bit quantization quality.
Method
The process involves converting a pretrained checkpoint to Megatron-LM format, then using `quantize.sh` with an NVFP4 config. NVIDIA Model Optimizer's `auto_quantize` can search for optimal per-layer formats based on a target bit budget and layer sensitivity.
In practice
- Apply Model Optimizer for NVFP4 quantization.
- Implement "four-over-six scaling" for MoE layers.
- Use `auto_quantize` to find optimal BPE.
Topics
- NVFP4 Quantization
- NVIDIA Model Optimizer
- Nemotron 3 Ultra
- Large Language Models
- Mixed-Precision Inference
- Four-over-six Scaling
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.