Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Double Preconditioning (DoPr) is a novel optimization paradigm introduced to address test-time feedback (TTF) in deep learning applications. TTF describes the growing mismatch between one-step training/validation loss and actual downstream performance metrics, such as task success rate or generation quality, particularly as task length increases. This phenomenon is prevalent in areas like autoregressive language modeling, flow-based generative modeling, and robot policy learning. DoPr combines existing gradient-wise preconditioning techniques, exemplified by Adam and Muon, with activation-wise preconditioning (AP), similar to KFAC. The research demonstrates that integrating AP acts as a drop-in intervention, significantly enhancing downstream model performance across various TTF scenarios. Notably, these improvements in test-time performance do not consistently correlate with better validation loss, prompting new considerations for evaluating models trained with one-step supervised objectives.

Key takeaway

For Machine Learning Engineers developing models with sequential prediction or generative tasks, you should consider Double Preconditioning (DoPr) to directly improve test-time performance. If your models exhibit test-time feedback, relying solely on validation loss can be misleading. Implement activation-wise preconditioning as a drop-in optimization to mitigate error accumulation and enhance downstream metrics. This works even if validation loss does not consistently improve. This approach offers a new axis for optimizing real-world model efficacy.

Key insights

Optimizing for test-time performance, not just validation loss, is crucial for deep learning models with test-time feedback.

Principles

Test-time feedback creates a growing divergence between training loss and task success.
Optimization techniques can directly combat error accumulation in sequential prediction.
Validation loss may not reflect true downstream performance in TTF scenarios.

Method

Double Preconditioning (DoPr) integrates gradient-wise preconditioning (e.g., Adam) with activation-wise preconditioning (AP, e.g., KFAC) as a drop-in intervention.

In practice

Implement activation-wise preconditioning in autoregressive language models.
Apply DoPr to improve flow-based generative model quality.
Enhance robot policy learning performance using DoPr.

Topics

Double Preconditioning
Test-Time Feedback
Optimization Algorithms
Autoregressive Models
Generative Models
Robot Policy Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.