Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
Summary
NVIDIA's NVFP4 training recipe, integrated into TransformerEngine for JAX pretraining, enables high-throughput, 4-bit mixed-precision pre-training on NVIDIA Blackwell GPUs. This subbyte precision format, supported natively on the NVIDIA GB300 Grace Blackwell Ultra Superchip, delivers up to 7x GEMM throughput compared to native FP8 on NVIDIA Hopper. Benchmarks on Llama 3 8B and Llama 3.1 405B models show NVFP4 provides a 1.31–1.73x speedup over FP8 baselines on GB200 and GB300, with no measurable accuracy loss; Llama 3 8B loss curves track the FP8 baseline with a mean gap of only +0.026 nats over 10,000 steps. The MaxText NVFP4 recipe is available in the JAX-Toolbox GitHub repository.
Key takeaway
For AI engineers and scientists focused on pre-training large language models, NVIDIA's NVFP4 on Blackwell offers a critical advantage. You can achieve 1.31–1.73x faster training throughput for models like Llama 3 8B and 405B on GB200/GB300, without sacrificing accuracy. Integrate the MaxText NVFP4 recipe into your JAX-based workflows to significantly reduce training step times and compute costs, enabling more efficient model development within existing budgets.
Key insights
NVFP4 enables significantly faster, accurate 4-bit LLM pre-training on NVIDIA Blackwell GPUs.
Principles
- Subbyte precision boosts LLM training throughput.
- Hardware-native low-precision formats yield substantial gains.
- Targeted quantization to MLP layers maintains accuracy.
Method
The NVFP4 recipe uses 16-element micro block scaling, E4M3 block scale factors, Random Hadamard Transform for WGRAD, 2D weight scaling, and stochastic rounding, quantizing only MLP GEMMs to NVFP4.
In practice
- Set "quantization=te_nvfp4" in MaxText for Blackwell.
- Utilize GB200/GB300 for 1.31-1.73x LLM pretraining speedup.
- Test "te_nvfp4_no_rht" for minimal overhead.
Topics
- NVFP4
- LLM Pretraining
- NVIDIA Blackwell
- JAX
- MaxText
- Low-Precision Training
- TransformerEngine
Code references
Best for: Machine Learning Engineer, AI Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.