Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Double Preconditioning (DoPr), introduced on June 4, 2026, is a novel optimization paradigm designed to improve test-time performance in deep learning applications plagued by "test-time feedback" (TTF). TTF describes the growing mismatch between training/validation loss and downstream metrics (e.g., task success, generation quality) when models roll out their own predictions, common in autoregressive language modeling, flow-based generative modeling, and robot policy learning. DoPr combines activation-wise preconditioning (AP), which encourages uniform feature learning by debiasing gradients from activation statistics, with standard gradient-wise preconditioning (GP) like Adam or Muon, which stabilizes and accelerates training. Experiments across continuous control (Humanoid-v5), image-based robot policy learning (Robomimic), and LLM fine-tuning (Llama-3.2-3B, Llama-3.1-8B) demonstrate that DoPr consistently boosts downstream performance, often without improving validation loss, highlighting a critical design space for optimizers beyond loss convergence.

Key takeaway

For Machine Learning Engineers developing models for autoregressive generation or sequential decision-making, you should consider adopting Double Preconditioning (DoPr) to enhance real-world task performance. Your validation loss may not accurately reflect downstream success in Test-Time Feedback (TTF) settings. DoPr offers a plug-in solution to improve feature learning and mitigate error accumulation, even if it doesn't always reduce your training loss. This allows you to optimize directly for critical downstream metrics like task success rate or generation quality.

Key insights

Test-time feedback (TTF) causes validation loss to misalign with downstream performance; DoPr mitigates this by improving feature learning.

Principles

TTF induces distribution shift, making error directions under feedback dynamics crucial.
Poor feature learning exacerbates TTF layer-wise, independent of validation loss.
AP corrects gradient bias from non-isotropic inputs, promoting uniform feature learning.

Method

DoPr applies an activation-covariance preconditioner (AP) to the layer-wise gradient, then passes this AP-gradient to a gradient preconditioner (GP) like Adam or Muon, followed by a standard weight update.

In practice

Use DoPr as a drop-in optimizer modification for TTF-prone tasks.
Apply existing GP hyperparameter scaling rules directly to DoPr variants.
Consider batch-wise statistics for AP to avoid additional memory overhead.

Topics

Deep Learning Optimizers
Test-Time Feedback
Activation Preconditioning
Gradient Preconditioning
Feature Learning
Autoregressive Models
Robot Policy Learning

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.