NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon
Summary
NVIDIA has introduced a novel 4-bit pretraining methodology utilizing its NVFP4 format, validated on a 12B hybrid Mamba-Transformer model, Nemotron-Nano-12B-v2-Base, trained over an unprecedented 10 trillion tokens. This marks the first public demonstration of 4-bit pretraining at such a multi-trillion-token scale, addressing previous limitations where 4-bit formats typically failed at longer token horizons. The NVFP4 format employs 16-element blocks, E4M3 block scales, and an FP32 per-tensor scale, achieving downstream accuracy closely comparable to an FP8 baseline. Key results include MMLU-Pro 5-shot scores of 62.58% (NVFP4) versus 62.62% (FP8), and validation loss within 1% of FP8 during the stable training phase.
Key takeaway
For AI Engineers and Research Scientists developing large language models, NVIDIA's NVFP4 methodology demonstrates that 4-bit pretraining is now feasible at multi-trillion-token scales without significant accuracy degradation compared to FP8. You should investigate integrating this specific recipe, including selective BF16 and Hadamard Transforms, into your pretraining pipelines to potentially reduce memory footprint and accelerate training, especially for hybrid Mamba-Transformer architectures.
Key insights
NVIDIA achieved the first multi-trillion-token 4-bit pretraining, matching FP8 accuracy with a hybrid Mamba-Transformer.
Principles
- 4-bit pretraining is viable at scale.
- Hybrid architectures can leverage quantization.
- Specific techniques are crucial for 4-bit stability.
Method
The pretraining recipe combines selective BF16 for ~16% of linear layers, 16x16 Random Hadamard Transforms on Wgrad inputs, 2D 16x16 weight scaling, and stochastic rounding on gradients, with ablations confirming all components are essential.
In practice
- Utilize NVFP4 for large-scale model pretraining.
- Apply selective BF16 to critical model components.
- Integrate Hadamard Transforms for Wgrad inputs.
Topics
- NVFP4
- 4-bit Pretraining
- Mamba-Transformer
- Large Language Models
- Mixed-Precision Training
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.