Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Double Preconditioning (DoPr) is a novel optimization paradigm introduced to address test-time feedback (TTF) in deep learning applications. TTF describes the growing mismatch between one-step training/validation loss and actual downstream performance metrics, such as task success rate or generation quality, particularly as task length increases. This phenomenon is prevalent in areas like autoregressive language modeling, flow-based generative modeling, and robot policy learning. DoPr combines existing gradient-wise preconditioning techniques, exemplified by Adam and Muon, with activation-wise preconditioning (AP), similar to KFAC. The research demonstrates that integrating AP acts as a drop-in intervention, significantly enhancing downstream model performance across various TTF scenarios. Notably, these improvements in test-time performance do not consistently correlate with better validation loss, prompting new considerations for evaluating models trained with one-step supervised objectives.

Key takeaway

For Machine Learning Engineers developing models with sequential prediction or generative tasks, you should consider Double Preconditioning (DoPr) to directly improve test-time performance. If your models exhibit test-time feedback, relying solely on validation loss can be misleading. Implement activation-wise preconditioning as a drop-in optimization to mitigate error accumulation and enhance downstream metrics. This works even if validation loss does not consistently improve. This approach offers a new axis for optimizing real-world model efficacy.

Key insights

Optimizing for test-time performance, not just validation loss, is crucial for deep learning models with test-time feedback.

Principles

Method

Double Preconditioning (DoPr) integrates gradient-wise preconditioning (e.g., Adam) with activation-wise preconditioning (AP, e.g., KFAC) as a drop-in intervention.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.