Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss
Summary
Double Preconditioning (DoPr) is a novel optimization paradigm introduced to address test-time feedback (TTF) in deep learning applications. TTF describes the growing mismatch between one-step training/validation loss and actual downstream performance metrics, such as task success rate or generation quality, particularly as task length increases. This phenomenon is prevalent in areas like autoregressive language modeling, flow-based generative modeling, and robot policy learning. DoPr combines existing gradient-wise preconditioning techniques, exemplified by Adam and Muon, with activation-wise preconditioning (AP), similar to KFAC. The research demonstrates that integrating AP acts as a drop-in intervention, significantly enhancing downstream model performance across various TTF scenarios. Notably, these improvements in test-time performance do not consistently correlate with better validation loss, prompting new considerations for evaluating models trained with one-step supervised objectives.
Key takeaway
For Machine Learning Engineers developing models with sequential prediction or generative tasks, you should consider Double Preconditioning (DoPr) to directly improve test-time performance. If your models exhibit test-time feedback, relying solely on validation loss can be misleading. Implement activation-wise preconditioning as a drop-in optimization to mitigate error accumulation and enhance downstream metrics. This works even if validation loss does not consistently improve. This approach offers a new axis for optimizing real-world model efficacy.
Key insights
Optimizing for test-time performance, not just validation loss, is crucial for deep learning models with test-time feedback.
Principles
- Test-time feedback creates a growing divergence between training loss and task success.
- Optimization techniques can directly combat error accumulation in sequential prediction.
- Validation loss may not reflect true downstream performance in TTF scenarios.
Method
Double Preconditioning (DoPr) integrates gradient-wise preconditioning (e.g., Adam) with activation-wise preconditioning (AP, e.g., KFAC) as a drop-in intervention.
In practice
- Implement activation-wise preconditioning in autoregressive language models.
- Apply DoPr to improve flow-based generative model quality.
- Enhance robot policy learning performance using DoPr.
Topics
- Double Preconditioning
- Test-Time Feedback
- Optimization Algorithms
- Autoregressive Models
- Generative Models
- Robot Policy Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.