AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO
Summary
AdaGRPO, an Adaptive Group Relative Policy Optimization algorithm, enhances text-to-image (T2I) flow models by addressing critical blind spots in existing GRPO frameworks. Current methods suffer from random prompt sampling and myopic advantage estimation, causing training instability and suboptimal human preference alignment. AdaGRPO introduces two main components: an Online Curriculum Filtering Strategy that dynamically tracks the model's proficiency to select prompts at its current learning boundary, and Cross-Level Advantage Fusion, which integrates fine-grained intra-group advantages with macro-level global advantages for comprehensive policy evaluation. This lightweight, plug-and-play module seamlessly integrates with frameworks like Flow-GRPO, DanceGRPO, and Flow-CPS. Experiments on Flux.1-dev, using the HPD dataset, demonstrate AdaGRPO consistently improves performance across metrics like HPS-v2/v3, ImageReward, and UniGenBench, while significantly stabilizing training on 8× NVIDIA H200 GPUs.
Key takeaway
For machine learning engineers fine-tuning flow-based text-to-image models with GRPO, you should consider integrating AdaGRPO to significantly enhance training stability and generation quality. Your current GRPO implementations likely suffer from suboptimal prompt selection and biased advantage estimation. AdaGRPO's adaptive prompt filtering and cross-level advantage fusion provide a more robust optimization signal, leading to superior human preference alignment. While it introduces a minor ~20% computational overhead per iteration, the performance gains justify this investment.
Key insights
Adaptive GRPO enhances text-to-image flow models by dynamically matching training prompts to model capability and fusing local-global advantage estimates.
Principles
- RL performance improves with capability-matched data selection.
- Global context is crucial for accurate policy advantage estimation.
- Curriculum learning stabilizes reinforcement learning optimization.
Method
AdaGRPO employs an Online Curriculum Filtering Strategy using an Exponential Moving Average (EMA) of ODE rewards to select prompts at the model's learning boundary. Cross-Level Advantage Fusion then combines intra-group and global advantages for policy updates.
In practice
- Implement EMA of historical rewards to track model proficiency.
- Filter training prompts to match the model's current capability.
- Combine local and global advantage signals for robust gradients.
Topics
- Group Relative Policy Optimization
- Text-to-Image Generation
- Flow Models
- Reinforcement Learning
- Curriculum Learning
- Advantage Estimation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.