Drifting Preference Optimization for One-Step Generative Models

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Drifting Preference Optimization (DrPO) is a novel online preference-finetuning method designed for deterministic one-step text-to-image generators, such as SD-Turbo and SDXL-Turbo. This approach addresses the challenge of aligning these efficient models with human preferences, a task difficult for traditional methods relying on policy likelihoods or differentiable reward gradients. DrPO operates by sampling image candidates for a given prompt, ranking them using a target reward, and then synthesizing a feature-space update direction from high- and low-scoring samples. This update combines a non-parametric dipole preference field with a reference drift from the frozen base generator, optimized via a detached feature-space regression target. Crucially, the target reward is used solely for ranking, enabling training with large, black-box, or non-differentiable rewards without impacting the single-call inference efficiency. Evaluations on SD-Turbo and SDXL-Turbo using HPSv3 and GenEval benchmarks demonstrate DrPO's improved alignment over existing reward-gradient-free baselines. It also significantly reduces HPSv3 training computation by 3.51\times by eliminating reward-model backpropagation.

Key takeaway

For Machine Learning Engineers developing one-step text-to-image generators, DrPO offers a critical solution for preference alignment. If you are struggling with finetuning models like SD-Turbo or SDXL-Turbo using complex or non-differentiable reward functions, you should consider implementing DrPO. This method allows you to achieve improved alignment and significantly reduce training computation by 3.51\times, without sacrificing the single-pass inference efficiency of your models.

Key insights

DrPO enables efficient preference finetuning for one-step generative models using black-box rewards and feature-space updates.

Principles

Method

DrPO samples candidates, ranks them with a target reward, then synthesizes a feature-space update direction from high/low-scoring samples, optimized via detached regression.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.