Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

2026-05-21 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Software Development & Engineering · Depth: Expert, extended

Summary

The article introduces a bias-correction framework for preconditioned optimizers used in language model training. It identifies two finite-sample biases: gradient–preconditioner coupling bias, arising from estimating both the gradient and preconditioner from the same minibatch, and inverse-preconditioner bias, caused by the nonlinearity of inversion even with an unbiased preconditioner estimate. The proposed single-batch bias-correction framework addresses these by employing cross-fitted preconditioning, which uses independent microbatch groups for gradient and preconditioner estimation, and variance-corrected inversion, which leverages microbatch variability to subtract the leading delta-method bias term. This framework is applicable to diagonal moment (AdamW), diagonal curvature (Sophia), and matrix preconditioning (Shampoo) methods. Experiments on Qwen2.5-0.5B pretraining demonstrated held-out loss reductions of 0.1489, 0.0701, and 0.1103 nats for AdamW, Sophia, and Shampoo, respectively. The framework consistently showed neutral-to-positive effects on mixed-quality pretraining and downstream instruction tuning, establishing it as a practical mechanism for improving optimizer performance.

Key takeaway

For AI Engineers optimizing large language models, you should consider integrating bias correction into your preconditioned optimizers. This framework, which addresses gradient–preconditioner coupling and inverse-preconditioner bias, can reduce held-out pretraining loss, as shown by reductions of up to 0.15 nats on Qwen2.5-0.5B. Evaluate this single-batch correction on your specific models and datasets, especially in noisy pretraining regimes, to potentially enhance stability and sample efficiency.

Key insights

Stochastic preconditioned optimizers suffer from two finite-sample biases that can be corrected for improved language model training.

Principles

Gradient–preconditioner coupling bias arises from same-minibatch estimation.
Inverse-preconditioner bias stems from nonlinear inversion of estimates.
Bias correction improves convergence constants and limiting suboptimality.

Method

The single-batch bias-correction framework uses cross-fitted preconditioning for independent gradient/preconditioner estimates and variance-corrected inversion via microbatch variability to subtract leading delta-method bias.

In practice

Apply cross-fitting by splitting batches into microbatch groups.
Use microbatch variability to estimate and subtract inverse bias.
Consider LOO construction for AdamW to maintain denominator scale.

Topics

Preconditioned Optimizers
Language Model Training
Bias Correction
AdamW
Sophia
Shampoo

Code references

fastino-ai/preconditioner-bias-correction

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.