Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers
Summary
The article introduces a bias-correction framework for preconditioned optimizers used in language model training. It identifies two finite-sample biases: gradient–preconditioner coupling bias, arising from estimating both the gradient and preconditioner from the same minibatch, and inverse-preconditioner bias, caused by the nonlinearity of inversion even with an unbiased preconditioner estimate. The proposed single-batch bias-correction framework addresses these by employing cross-fitted preconditioning, which uses independent microbatch groups for gradient and preconditioner estimation, and variance-corrected inversion, which leverages microbatch variability to subtract the leading delta-method bias term. This framework is applicable to diagonal moment (AdamW), diagonal curvature (Sophia), and matrix preconditioning (Shampoo) methods. Experiments on Qwen2.5-0.5B pretraining demonstrated held-out loss reductions of 0.1489, 0.0701, and 0.1103 nats for AdamW, Sophia, and Shampoo, respectively. The framework consistently showed neutral-to-positive effects on mixed-quality pretraining and downstream instruction tuning, establishing it as a practical mechanism for improving optimizer performance.
Key takeaway
For AI Engineers optimizing large language models, you should consider integrating bias correction into your preconditioned optimizers. This framework, which addresses gradient–preconditioner coupling and inverse-preconditioner bias, can reduce held-out pretraining loss, as shown by reductions of up to 0.15 nats on Qwen2.5-0.5B. Evaluate this single-batch correction on your specific models and datasets, especially in noisy pretraining regimes, to potentially enhance stability and sample efficiency.
Key insights
Stochastic preconditioned optimizers suffer from two finite-sample biases that can be corrected for improved language model training.
Principles
- Gradient–preconditioner coupling bias arises from same-minibatch estimation.
- Inverse-preconditioner bias stems from nonlinear inversion of estimates.
- Bias correction improves convergence constants and limiting suboptimality.
Method
The single-batch bias-correction framework uses cross-fitted preconditioning for independent gradient/preconditioner estimates and variance-corrected inversion via microbatch variability to subtract leading delta-method bias.
In practice
- Apply cross-fitting by splitting batches into microbatch groups.
- Use microbatch variability to estimate and subtract inverse bias.
- Consider LOO construction for AdamW to maintain denominator scale.
Topics
- Preconditioned Optimizers
- Language Model Training
- Bias Correction
- AdamW
- Sophia
- Shampoo
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.