Drifting Preference Optimization for One-Step Generative Models
Summary
Drifting Preference Optimization (DrPO) is a novel online preference-finetuning method designed for deterministic one-step text-to-image generators, such as SD-Turbo and SDXL-Turbo. This approach addresses the challenge of aligning these efficient models with human preferences, a task difficult for traditional methods relying on policy likelihoods or differentiable reward gradients. DrPO operates by sampling image candidates for a given prompt, ranking them using a target reward, and then synthesizing a feature-space update direction from high- and low-scoring samples. This update combines a non-parametric dipole preference field with a reference drift from the frozen base generator, optimized via a detached feature-space regression target. Crucially, the target reward is used solely for ranking, enabling training with large, black-box, or non-differentiable rewards without impacting the single-call inference efficiency. Evaluations on SD-Turbo and SDXL-Turbo using HPSv3 and GenEval benchmarks demonstrate DrPO's improved alignment over existing reward-gradient-free baselines. It also significantly reduces HPSv3 training computation by 3.51\times by eliminating reward-model backpropagation.
Key takeaway
For Machine Learning Engineers developing one-step text-to-image generators, DrPO offers a critical solution for preference alignment. If you are struggling with finetuning models like SD-Turbo or SDXL-Turbo using complex or non-differentiable reward functions, you should consider implementing DrPO. This method allows you to achieve improved alignment and significantly reduce training computation by 3.51\times, without sacrificing the single-pass inference efficiency of your models.
Key insights
DrPO enables efficient preference finetuning for one-step generative models using black-box rewards and feature-space updates.
Principles
- Reward models can be black-box for ranking.
- Feature-space updates guide preference alignment.
- Reference drift stabilizes finetuning.
Method
DrPO samples candidates, ranks them with a target reward, then synthesizes a feature-space update direction from high/low-scoring samples, optimized via detached regression.
In practice
- Apply DrPO to finetune SD-Turbo/SDXL-Turbo.
- Use non-differentiable reward models for alignment.
- Reduce training computation by 3.51\times.
Topics
- Drifting Preference Optimization
- One-Step Generative Models
- Text-to-Image Generation
- Model Finetuning
- Reward Models
- SD-Turbo, SDXL-Turbo
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.