Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Summary
A new study reveals that current FP4 hardware paths and recipes for LLM pretraining, including NVIDIA Blackwell/Rubin-class and AMD MI350-series GPUs, which use E2M1 data elements, suffer from "Shrinkage Bias." This bias stems from the geometric asymmetry of E2M1's representable bins, causing systematic negative rounding errors that accumulate across layers and are amplified by the Random Hadamard Transform (RHT), leading to training instability. In contrast, uniform grids like E1M2/INT4 avoid this grid-geometry error, converting RHT's improved bucket utilization into higher quantization quality. Researchers propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs and restricts stochastic rounding to dY. UFP4 consistently achieved lower BF16-relative loss degradation than E2M1-based baselines during long-run pretraining on Dense 1.5B, MoE 7.9B, and MoE 124B models. This suggests future accelerators should prioritize E1M2/INT4-style uniform 4-bit grids.
Key takeaway
For AI Hardware Engineers designing next-generation accelerators, you should prioritize supporting E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1. This shift is critical because E2M1's inherent "Shrinkage Bias" leads to training instability and loss degradation in LLM pretraining, as demonstrated on models up to MoE 124B. Your designs should enable uniform grid formats to achieve more stable and higher-quality quantization for future FP4 training.
Key insights
E2M1 FP4 training suffers "Shrinkage Bias" from geometric asymmetry, while uniform E1M2/INT4 grids offer stable, higher-quality quantization.
Principles
- Non-uniform FP4 formats like E2M1 cause systematic negative rounding errors.
- Uniform grids (E1M2/INT4) bypass grid-geometry errors for better quantization.
- RHT amplifies shrinkage bias in E2M1 but improves bucket utilization in uniform grids.
Method
The UFP4 recipe applies Random Hadamard Transform (RHT) to all three training GEMMs and restricts stochastic rounding solely to dY for uniform 4-bit training.
In practice
- Consider E1M2/INT4 for 4-bit LLM pretraining.
- Implement RHT across all training GEMMs.
- Restrict stochastic rounding to dY in 4-bit training.
Topics
- LLM Pretraining
- FP4 Quantization
- Shrinkage Bias
- E2M1 Format
- E1M2/INT4 Grids
- Random Hadamard Transform
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.