Using NVFP4 Low-Precision Model Training for Higher Throughput Without Losing Accuracy

2026-02-23 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

NVIDIA research demonstrates that low-precision training formats, including 8-bit floating point per-tensor current scaling (FP8-CS), Mixed Precision FP8 (MXFP8), and NVFP4, can significantly enhance training throughput and memory efficiency for large transformer models without compromising model quality. Experiments conducted on Llama 3 8B and an NVIDIA Research-8B model, trained on multi-hundred-billion token datasets using NVIDIA B200 GPUs and NeMo Megatron Bridge, show these formats achieve up to ~1.6x higher throughput compared to BF16. While NVFP4 exhibits slightly higher training loss, all low-precision methods maintain downstream task accuracy comparable to BF16, with MXFP8 performing marginally better due to finer-grained scaling. Selective BF16 layers are crucial for NVFP4 stability, specifically keeping the final four transformer layers in BF16.

Key takeaway

For AI Engineers scaling large transformer models, adopting low-precision training with formats like FP8, MXFP8, or NVFP4 can significantly accelerate training and reduce memory footprint. You should explore NVIDIA NeMo Megatron Bridge's production-ready recipes to achieve up to 1.6x throughput gains and substantial memory savings, ensuring your models maintain BF16-comparable accuracy. Consider using selective BF16 layers, especially for NVFP4, to ensure training stability.

Key insights

Low-precision training boosts throughput and saves memory for large models while preserving accuracy.

Principles

Reduced precision increases GPU operations per cycle.
Finer-grained scaling improves low-precision performance.
Selective BF16 layers stabilize aggressive quantization.

Method

Compare BF16 against FP8-CS, MXFP8, and NVFP4 on Llama 3 8B and Research-8B models, evaluating convergence and downstream accuracy on 1 trillion tokens using NeMo Megatron Bridge on NVIDIA B200 GPUs.

In practice

Use NeMo Megatron Bridge for low-precision recipes.
Configure NVFP4 with AdamW ε=1e-8, LR=6e-4 → 6e-6, GBS=768.
Retain final transformer layers in BF16 for NVFP4.

Topics

Low-Precision Training
Transformer Models
NVIDIA NeMo Megatron Bridge
FP8
NVFP4

Code references

Best for: AI Engineer, Deep Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.