AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Generative AI · Depth: Expert, extended

Summary

AdaGRPO, an Adaptive Group Relative Policy Optimization algorithm, enhances text-to-image (T2I) flow models by addressing critical blind spots in existing GRPO frameworks. Current methods suffer from random prompt sampling and myopic advantage estimation, causing training instability and suboptimal human preference alignment. AdaGRPO introduces two main components: an Online Curriculum Filtering Strategy that dynamically tracks the model's proficiency to select prompts at its current learning boundary, and Cross-Level Advantage Fusion, which integrates fine-grained intra-group advantages with macro-level global advantages for comprehensive policy evaluation. This lightweight, plug-and-play module seamlessly integrates with frameworks like Flow-GRPO, DanceGRPO, and Flow-CPS. Experiments on Flux.1-dev, using the HPD dataset, demonstrate AdaGRPO consistently improves performance across metrics like HPS-v2/v3, ImageReward, and UniGenBench, while significantly stabilizing training on 8× NVIDIA H200 GPUs.

Key takeaway

For machine learning engineers fine-tuning flow-based text-to-image models with GRPO, you should consider integrating AdaGRPO to significantly enhance training stability and generation quality. Your current GRPO implementations likely suffer from suboptimal prompt selection and biased advantage estimation. AdaGRPO's adaptive prompt filtering and cross-level advantage fusion provide a more robust optimization signal, leading to superior human preference alignment. While it introduces a minor ~20% computational overhead per iteration, the performance gains justify this investment.

Key insights

Adaptive GRPO enhances text-to-image flow models by dynamically matching training prompts to model capability and fusing local-global advantage estimates.

Principles

RL performance improves with capability-matched data selection.
Global context is crucial for accurate policy advantage estimation.
Curriculum learning stabilizes reinforcement learning optimization.

Method

AdaGRPO employs an Online Curriculum Filtering Strategy using an Exponential Moving Average (EMA) of ODE rewards to select prompts at the model's learning boundary. Cross-Level Advantage Fusion then combines intra-group and global advantages for policy updates.

In practice

Implement EMA of historical rewards to track model proficiency.
Filter training prompts to match the model's current capability.
Combine local and global advantage signals for robust gradients.

Topics

Group Relative Policy Optimization
Text-to-Image Generation
Flow Models
Reinforcement Learning
Curriculum Learning
Advantage Estimation

Code references

black-forest-labs/flux

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.