Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
Summary
Double Preconditioning (DoPr), introduced on June 4, 2026, is a novel optimization paradigm designed to improve test-time performance in deep learning applications plagued by "test-time feedback" (TTF). TTF describes the growing mismatch between training/validation loss and downstream metrics (e.g., task success, generation quality) when models roll out their own predictions, common in autoregressive language modeling, flow-based generative modeling, and robot policy learning. DoPr combines activation-wise preconditioning (AP), which encourages uniform feature learning by debiasing gradients from activation statistics, with standard gradient-wise preconditioning (GP) like Adam or Muon, which stabilizes and accelerates training. Experiments across continuous control (Humanoid-v5), image-based robot policy learning (Robomimic), and LLM fine-tuning (Llama-3.2-3B, Llama-3.1-8B) demonstrate that DoPr consistently boosts downstream performance, often without improving validation loss, highlighting a critical design space for optimizers beyond loss convergence.
Key takeaway
For Machine Learning Engineers developing models for autoregressive generation or sequential decision-making, you should consider adopting Double Preconditioning (DoPr) to enhance real-world task performance. Your validation loss may not accurately reflect downstream success in Test-Time Feedback (TTF) settings. DoPr offers a plug-in solution to improve feature learning and mitigate error accumulation, even if it doesn't always reduce your training loss. This allows you to optimize directly for critical downstream metrics like task success rate or generation quality.
Key insights
Test-time feedback (TTF) causes validation loss to misalign with downstream performance; DoPr mitigates this by improving feature learning.
Principles
- TTF induces distribution shift, making error directions under feedback dynamics crucial.
- Poor feature learning exacerbates TTF layer-wise, independent of validation loss.
- AP corrects gradient bias from non-isotropic inputs, promoting uniform feature learning.
Method
DoPr applies an activation-covariance preconditioner (AP) to the layer-wise gradient, then passes this AP-gradient to a gradient preconditioner (GP) like Adam or Muon, followed by a standard weight update.
In practice
- Use DoPr as a drop-in optimizer modification for TTF-prone tasks.
- Apply existing GP hyperparameter scaling rules directly to DoPr variants.
- Consider batch-wise statistics for AP to avoid additional memory overhead.
Topics
- Deep Learning Optimizers
- Test-Time Feedback
- Activation Preconditioning
- Gradient Preconditioning
- Feature Learning
- Autoregressive Models
- Robot Policy Learning
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.