Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

2026-06-08 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

NVIDIA's NVFP4 training recipe, integrated into TransformerEngine for JAX pretraining, enables high-throughput, 4-bit mixed-precision pre-training on NVIDIA Blackwell GPUs. This subbyte precision format, supported natively on the NVIDIA GB300 Grace Blackwell Ultra Superchip, delivers up to 7x GEMM throughput compared to native FP8 on NVIDIA Hopper. Benchmarks on Llama 3 8B and Llama 3.1 405B models show NVFP4 provides a 1.31–1.73x speedup over FP8 baselines on GB200 and GB300, with no measurable accuracy loss; Llama 3 8B loss curves track the FP8 baseline with a mean gap of only +0.026 nats over 10,000 steps. The MaxText NVFP4 recipe is available in the JAX-Toolbox GitHub repository.

Key takeaway

For AI engineers and scientists focused on pre-training large language models, NVIDIA's NVFP4 on Blackwell offers a critical advantage. You can achieve 1.31–1.73x faster training throughput for models like Llama 3 8B and 405B on GB200/GB300, without sacrificing accuracy. Integrate the MaxText NVFP4 recipe into your JAX-based workflows to significantly reduce training step times and compute costs, enabling more efficient model development within existing budgets.

Key insights

NVFP4 enables significantly faster, accurate 4-bit LLM pre-training on NVIDIA Blackwell GPUs.

Principles

Subbyte precision boosts LLM training throughput.
Hardware-native low-precision formats yield substantial gains.
Targeted quantization to MLP layers maintains accuracy.

Method

The NVFP4 recipe uses 16-element micro block scaling, E4M3 block scale factors, Random Hadamard Transform for WGRAD, 2D weight scaling, and stochastic rounding, quantizing only MLP GEMMs to NVFP4.

In practice

Set "quantization=te_nvfp4" in MaxText for Blackwell.
Utilize GB200/GB300 for 1.31-1.73x LLM pretraining speedup.
Test "te_nvfp4_no_rht" for minimal overhead.

Topics

NVFP4
LLM Pretraining
NVIDIA Blackwell
JAX
MaxText
Low-Precision Training
TransformerEngine

Code references

NVIDIA/JAX-Toolbox

Best for: Machine Learning Engineer, AI Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.