ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents
Summary
ReGRPO (Reflection-augmented Group Relative Policy Optimization) is a novel framework designed to enhance the robustness of tool-augmented vision-language models (VLMs) in multimodal, multi-step tasks. It addresses common limitations in existing approaches, such as supervised fine-tuning's reliance on successful trajectories and reinforcement learning's sparse rewards, which offer insufficient guidance for recovering from tool failures. ReGRPO introduces a structured reflective data engine that collects grounded failure observations from near-miss actions. This data is used to construct Reflection-of-Thought (RoT) triplets, comprising ErrorType, Evidence, and FixPlan, which are then paired with corrected actions for warm-start supervised fine-tuning. The framework further optimizes reflection tokens and corrective actions within local trajectories using group-relative advantages, while a reflection-cost term minimizes unnecessary reflection. Evaluations on GTA and GAIA benchmarks demonstrate that ReGRPO consistently surpasses strong open-source baselines, achieving superior performance among open-source controllers using identical backbones and tool suites.
Key takeaway
For Machine Learning Engineers developing tool-augmented vision-language models, ReGRPO provides a robust framework to significantly improve agent reliability and failure recovery. If your current SFT or RL approaches struggle with tool failures, you should investigate incorporating structured reflective data collection and joint optimization of reflection tokens. This method directly addresses the fragility of existing systems, offering a clear path to more resilient and autonomous tool-using agents in complex multimodal tasks.
Key insights
ReGRPO improves tool-using VLMs by learning reflection-guided recovery from failures using structured error data and joint optimization.
Principles
- Structured reflection data improves agent recovery.
- Jointly optimize reflection and corrective actions.
- Cost term reduces unnecessary reflection.
Method
ReGRPO collects failure observations, builds Reflection-of-Thought (RoT) triplets (ErrorType, Evidence, FixPlan) with corrected actions for SFT, then optimizes reflection tokens and actions using group-relative advantages and a reflection-cost term.
In practice
- Collect near-miss actions for failure data.
- Generate ErrorType, Evidence, FixPlan triplets.
- Apply reflection-cost to reduce overhead.
Topics
- Tool-Augmented VLMs
- Policy Optimization
- Reflection-of-Thought
- Error Recovery
- Multimodal Agents
- Reinforcement Learning
Code references
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.