Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

NVIDIA's NVFP4 training recipe, integrated into TransformerEngine for JAX pretraining, enables high-throughput, 4-bit mixed-precision pre-training on NVIDIA Blackwell GPUs. This subbyte precision format, supported natively on the NVIDIA GB300 Grace Blackwell Ultra Superchip, delivers up to 7x GEMM throughput compared to native FP8 on NVIDIA Hopper. Benchmarks on Llama 3 8B and Llama 3.1 405B models show NVFP4 provides a 1.31–1.73x speedup over FP8 baselines on GB200 and GB300, with no measurable accuracy loss; Llama 3 8B loss curves track the FP8 baseline with a mean gap of only +0.026 nats over 10,000 steps. The MaxText NVFP4 recipe is available in the JAX-Toolbox GitHub repository.

Key takeaway

For AI engineers and scientists focused on pre-training large language models, NVIDIA's NVFP4 on Blackwell offers a critical advantage. You can achieve 1.31–1.73x faster training throughput for models like Llama 3 8B and 405B on GB200/GB300, without sacrificing accuracy. Integrate the MaxText NVFP4 recipe into your JAX-based workflows to significantly reduce training step times and compute costs, enabling more efficient model development within existing budgets.

Key insights

NVFP4 enables significantly faster, accurate 4-bit LLM pre-training on NVIDIA Blackwell GPUs.

Principles

Method

The NVFP4 recipe uses 16-element micro block scaling, E4M3 block scale factors, Random Hadamard Transform for WGRAD, 2D weight scaling, and stochastic rounding, quantizing only MLP GEMMs to NVFP4.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.