When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
Summary
A new study reveals that the common assumption of post-training quantization (PTQ), that a converged model is quantization-ready, fails specifically for INT4 quantization. Analyzing 154 Pythia-160m training checkpoints with a calibration-free per-group INT4 probe, researchers identified a three-phase divergence: an initial phase of joint improvement, a meta-stable plateau where FP32 perplexity stagnates but the INT4 gap remains bounded, and an explosive divergence phase where the INT4 gap increases from 11% to 517% while FP32 perplexity shows minimal change. This divergence initiates upon FP32 perplexity convergence, suggesting post-convergence weight updates, not just learning rate decay, are the cause. INT8 quantization is unaffected, pointing to the 16-level INT4 grid's coarseness as the mechanism, with weight outlier accumulation ruled out. An Oscillatory Lock-In learning rate schedule was shown to reduce the INT4 gap by 2.2 percentage points on average compared to cosine continuation.
Key takeaway
For AI engineers optimizing large language models with INT4 post-training quantization, you should be aware that model convergence in FP32 does not guarantee INT4 robustness. Your models may experience an "explosive divergence" in INT4 performance even as FP32 perplexity stabilizes. Consider implementing Oscillatory Lock-In learning rate schedules during fine-tuning or post-convergence training to mitigate this INT4 gap, or opt for INT8 quantization if robustness is paramount.
Key insights
INT4 quantization robustness can collapse catastrophically after FP32 convergence, a phenomenon not seen with INT8.
Principles
- PTQ assumptions fail for INT4.
- INT4 divergence is tied to FP32 convergence.
- Schedule amplitude affects quantization robustness.
Method
A calibration-free per-group INT4 probe was applied to Pythia-160m checkpoints to characterize divergence phases. A controlled fork experiment compared learning rate schedules.
In practice
- Avoid naive INT4 PTQ post-convergence.
- Consider INT8 for better robustness.
- Explore Oscillatory Lock-In schedules.
Topics
- Post-training Quantization
- INT4 Quantization Collapse
- FP32 Convergence
- Learning Rate Schedules
- Oscillatory Lock-In
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.