AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Generative AI · Depth: Expert, extended

Summary

AdaGRPO, an Adaptive Group Relative Policy Optimization algorithm, enhances text-to-image (T2I) flow models by addressing critical blind spots in existing GRPO frameworks. Current methods suffer from random prompt sampling and myopic advantage estimation, causing training instability and suboptimal human preference alignment. AdaGRPO introduces two main components: an Online Curriculum Filtering Strategy that dynamically tracks the model's proficiency to select prompts at its current learning boundary, and Cross-Level Advantage Fusion, which integrates fine-grained intra-group advantages with macro-level global advantages for comprehensive policy evaluation. This lightweight, plug-and-play module seamlessly integrates with frameworks like Flow-GRPO, DanceGRPO, and Flow-CPS. Experiments on Flux.1-dev, using the HPD dataset, demonstrate AdaGRPO consistently improves performance across metrics like HPS-v2/v3, ImageReward, and UniGenBench, while significantly stabilizing training on 8× NVIDIA H200 GPUs.

Key takeaway

For machine learning engineers fine-tuning flow-based text-to-image models with GRPO, you should consider integrating AdaGRPO to significantly enhance training stability and generation quality. Your current GRPO implementations likely suffer from suboptimal prompt selection and biased advantage estimation. AdaGRPO's adaptive prompt filtering and cross-level advantage fusion provide a more robust optimization signal, leading to superior human preference alignment. While it introduces a minor ~20% computational overhead per iteration, the performance gains justify this investment.

Key insights

Adaptive GRPO enhances text-to-image flow models by dynamically matching training prompts to model capability and fusing local-global advantage estimates.

Principles

Method

AdaGRPO employs an Online Curriculum Filtering Strategy using an Exponential Moving Average (EMA) of ODE rewards to select prompts at the model's learning boundary. Cross-Level Advantage Fusion then combines intra-group and global advantages for policy updates.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.