Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy
Summary
NVIDIA research demonstrates that low-precision training formats, including 8-bit floating point per-tensor current scaling (FP8-CS), Mixed Precision FP8 (MXFP8), and NVFP4, can significantly enhance training throughput and memory efficiency for large transformer models without compromising model quality. Experiments conducted on Llama 3 8B and an NVIDIA Research-8B model, trained on multi-hundred-billion token datasets using NVIDIA B200 GPUs and NeMo Megatron Bridge, show these formats achieve up to ~1.6x higher throughput compared to BF16. While NVFP4 exhibits slightly higher training loss, all low-precision methods maintain downstream task accuracy comparable to BF16, with MXFP8 performing marginally better due to finer-grained scaling. Selective BF16 layers are crucial for NVFP4 stability, specifically keeping the final four transformer layers in BF16.
Key takeaway
For AI Engineers scaling large transformer models, adopting low-precision training with formats like FP8, MXFP8, or NVFP4 can significantly accelerate training and reduce memory footprint. You should explore NVIDIA NeMo Megatron Bridge's production-ready recipes to achieve up to 1.6x throughput gains and substantial memory savings, ensuring your models maintain BF16-comparable accuracy. Consider using selective BF16 layers, especially for NVFP4, to ensure training stability.
Key insights
Low-precision training boosts throughput and saves memory for large models while preserving accuracy.
Principles
- Reduced precision increases GPU operations per cycle.
- Finer-grained scaling improves low-precision performance.
- Selective BF16 layers stabilize aggressive quantization.
Method
Compare BF16 against FP8-CS, MXFP8, and NVFP4 on Llama 3 8B and Research-8B models, evaluating convergence and downstream accuracy on 1 trillion tokens using NeMo Megatron Bridge on NVIDIA B200 GPUs.
In practice
- Use NeMo Megatron Bridge for low-precision recipes.
- Configure NVFP4 with AdamW ε=1e-8, LR=6e-4 → 6e-6, GBS=768.
- Retain final transformer layers in BF16 for NVFP4.
Topics
- Low-Precision Training
- Transformer Models
- NVIDIA NeMo Megatron Bridge
- FP8
- NVFP4
Code references
Best for: AI Engineer, Deep Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.