Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Software Development & Engineering · Depth: Expert, extended

Summary

The article introduces a bias-correction framework for preconditioned optimizers used in language model training. It identifies two finite-sample biases: gradient–preconditioner coupling bias, arising from estimating both the gradient and preconditioner from the same minibatch, and inverse-preconditioner bias, caused by the nonlinearity of inversion even with an unbiased preconditioner estimate. The proposed single-batch bias-correction framework addresses these by employing cross-fitted preconditioning, which uses independent microbatch groups for gradient and preconditioner estimation, and variance-corrected inversion, which leverages microbatch variability to subtract the leading delta-method bias term. This framework is applicable to diagonal moment (AdamW), diagonal curvature (Sophia), and matrix preconditioning (Shampoo) methods. Experiments on Qwen2.5-0.5B pretraining demonstrated held-out loss reductions of 0.1489, 0.0701, and 0.1103 nats for AdamW, Sophia, and Shampoo, respectively. The framework consistently showed neutral-to-positive effects on mixed-quality pretraining and downstream instruction tuning, establishing it as a practical mechanism for improving optimizer performance.

Key takeaway

For AI Engineers optimizing large language models, you should consider integrating bias correction into your preconditioned optimizers. This framework, which addresses gradient–preconditioner coupling and inverse-preconditioner bias, can reduce held-out pretraining loss, as shown by reductions of up to 0.15 nats on Qwen2.5-0.5B. Evaluate this single-batch correction on your specific models and datasets, especially in noisy pretraining regimes, to potentially enhance stability and sample efficiency.

Key insights

Stochastic preconditioned optimizers suffer from two finite-sample biases that can be corrected for improved language model training.

Principles

Method

The single-batch bias-correction framework uses cross-fitted preconditioning for independent gradient/preconditioner estimates and variance-corrected inversion via microbatch variability to subtract leading delta-method bias.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.