Are we really tilting? The mechanics of reward guidance in flow and diffusion models
Summary
Reward guidance algorithms steer a learned generative process toward reward-tilted measures during inference but are prone to "reward hacking," where the model over-optimizes the reward at the cost of fidelity. While prior work attributed this to neural reward function complexity or diffusion training biases, this research reveals its fundamental origin: finite-particle plug-in estimation of the Doob h-function, an approximation common in practical reward-guided diffusion implementations. This issue persists even in simple Gaussian and Gaussian mixture targets with quadratic rewards. The analysis identifies two distinct failure modes: within-mode reward hacking and inability to select high-reward modes. A closed-form reward damping schedule is proposed to correct the within-mode bias without additional compute, and best-of-n sampling is clarified as a compensation for mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm these theoretical insights in practical settings.
Key takeaway
For Machine Learning Engineers implementing reward-guided diffusion models, understanding the root causes of reward hacking is crucial. You should recognize that finite-particle plug-in estimation of the Doob h-function is a primary culprit, causing both within-mode bias and mode selection issues. To mitigate this, integrate the proposed closed-form reward damping schedule to correct within-mode biases and consider best-of-n sampling to address mode selection failures, improving model fidelity without extra computational cost.
Key insights
Reward hacking in reward-guided diffusion models originates from finite-particle plug-in estimation of the Doob h-function, leading to specific failure modes.
Principles
- Reward hacking stems from plug-in estimation.
- Two failure modes: within-mode and mode selection.
- Damping corrects within-mode bias.
Method
A closed-form reward damping schedule corrects within-mode bias in reward-guided diffusion without additional compute. Best-of-n sampling compensates for mode selection failure.
In practice
- Implement damping schedule for within-mode bias.
- Employ best-of-n sampling for mode selection.
- Applicable to text-to-image models like FLUX.1.
Topics
- Reward Guidance
- Diffusion Models
- Reward Hacking
- Doob h-function
- Damping Schedules
- FLUX.1
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.