Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer

2026-06-26 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

NVIDIA has released the Nemotron 3 Ultra NVFP4 checkpoint, a quantized version of the 550B model optimized using NVIDIA Model Optimizer. This checkpoint utilizes NVFP4, a 4-bit floating-point format from the NVIDIA Blackwell architecture, to achieve up to 5.9x higher inference throughput compared to the GLM-5.1 754B FP4 model on decode-heavy tasks, while maintaining BF16 accuracy. The quantization process reduced the model's size from 1,121 GB in BF16 to 352.3 GB, a 3.2x reduction. A key innovation allows the checkpoint to run on both NVIDIA Hopper (W4A16) and Blackwell (W4A4) hardware. The development involved exploring scaling methods like max and MSE, ultimately favoring "four-over-six scaling" for its ability to minimize reconstruction error, which cut median reconstruction MSE by 16.4% for MoE expert layers. The optimal effective bits-per-element was determined to be 5.03. Quantization of the 550B model was parallelized using NVIDIA Megatron-LM, reducing calibration time from approximately 85 minutes to 9 minutes.

Key takeaway

For AI Engineers optimizing large language model inference, consider adopting NVIDIA's Nemotron 3 Ultra NVFP4 quantization strategy. This approach, utilizing NVIDIA Model Optimizer, can significantly boost throughput by up to 5.9x and reduce model footprint by 3.2x without sacrificing BF16 accuracy. You should explore adaptive scaling methods like "four-over-six" and utilize `auto_quantize` to find the optimal bits-per-element for your specific models, ensuring efficient deployment across diverse NVIDIA hardware.

Key insights

High-quality 4-bit quantization for large language models requires sophisticated, adaptive scaling and mixed-precision strategies.

Principles

Mixed-precision quantization optimizes performance.
Outlier-sensitive scaling degrades model accuracy.
Adaptive scaling improves 4-bit quantization quality.

Method

The process involves converting a pretrained checkpoint to Megatron-LM format, then using `quantize.sh` with an NVFP4 config. NVIDIA Model Optimizer's `auto_quantize` can search for optimal per-layer formats based on a target bit budget and layer sensitivity.

In practice

Apply Model Optimizer for NVFP4 quantization.
Implement "four-over-six scaling" for MoE layers.
Use `auto_quantize` to find optimal BPE.

Topics

NVFP4 Quantization
NVIDIA Model Optimizer
Nemotron 3 Ultra
Large Language Models
Mixed-Precision Inference
Four-over-six Scaling

Code references

Best for: Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.