NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon

2026-05-18 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

NVIDIA has introduced a novel 4-bit pretraining methodology utilizing its NVFP4 format, validated on a 12B hybrid Mamba-Transformer model, Nemotron-Nano-12B-v2-Base, trained over an unprecedented 10 trillion tokens. This marks the first public demonstration of 4-bit pretraining at such a multi-trillion-token scale, addressing previous limitations where 4-bit formats typically failed at longer token horizons. The NVFP4 format employs 16-element blocks, E4M3 block scales, and an FP32 per-tensor scale, achieving downstream accuracy closely comparable to an FP8 baseline. Key results include MMLU-Pro 5-shot scores of 62.58% (NVFP4) versus 62.62% (FP8), and validation loss within 1% of FP8 during the stable training phase.

Key takeaway

For AI Engineers and Research Scientists developing large language models, NVIDIA's NVFP4 methodology demonstrates that 4-bit pretraining is now feasible at multi-trillion-token scales without significant accuracy degradation compared to FP8. You should investigate integrating this specific recipe, including selective BF16 and Hadamard Transforms, into your pretraining pipelines to potentially reduce memory footprint and accelerate training, especially for hybrid Mamba-Transformer architectures.

Key insights

NVIDIA achieved the first multi-trillion-token 4-bit pretraining, matching FP8 accuracy with a hybrid Mamba-Transformer.

Principles

4-bit pretraining is viable at scale.
Hybrid architectures can leverage quantization.
Specific techniques are crucial for 4-bit stability.

Method

The pretraining recipe combines selective BF16 for ~16% of linear layers, 16x16 Random Hadamard Transforms on Wgrad inputs, 2D 16x16 weight scaling, and stochastic rounding on gradients, with ablations confirming all components are essential.

In practice

Utilize NVFP4 for large-scale model pretraining.
Apply selective BF16 to critical model components.
Integrate Hadamard Transforms for Wgrad inputs.

Topics

NVFP4
4-bit Pretraining
Mamba-Transformer
Large Language Models
Mixed-Precision Training

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.