The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL
Summary
Discriminator-Guided RL (DRL) is proposed to correct flow matching models, addressing a structural mismatch where standard matching losses poorly align with visual realism and coherent object structure at inference. DRL trains a discriminator within a pretrained representation space to differentiate real data from base-model samples, utilizing its logit as a reward for KL-regularized reinforcement learning. This approach sidesteps the need for expensive human preferences. Across SiT, JiT, REPA, and RAE backbones, DRL significantly reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT). It also improves human-preference rewards without direct training and yields a better Pareto frontier for alignment and artifact reduction.
Key takeaway
For machine learning engineers developing generative models, especially those using flow matching, DRL offers a robust method to significantly enhance sample quality and realism. By leveraging a discriminator in a pretrained representation space, you can achieve superior image fidelity, reduce artifacts, and improve semantic coherence without relying on costly human preference data. Consider integrating DRL into your training pipeline to boost model performance and alignment.
Key insights
DRL corrects flow matching models by using a discriminator in a pretrained space to provide a data-aligned reward, overcoming L2 loss limitations.
Principles
- L2 regression matching losses poorly proxy visual/semantic quality.
- RL with an aligned reward can directly optimize for sample quality.
- Discriminators in pretrained spaces offer human-preference-free rewards.
Method
DRL trains a discriminator in a pretrained representation space to distinguish real data from model samples, then uses its logit as the reward for KL-regularized reinforcement learning.
In practice
- Apply DRL to improve guidance-free FID and semantic FD.
- Enhance human-preference alignment without direct training.
- Reduce low-level artifacts like oversaturation in generated images.
Topics
- Flow Matching
- Reinforcement Learning
- Generative Models
- Discriminator Networks
- Image Fidelity
- Computer Vision
- Fréchet Inception Distance
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.