ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

ReGRPO (Reflection-augmented Group Relative Policy Optimization) is a novel framework designed to enhance the robustness of tool-augmented vision-language models (VLMs) in multimodal, multi-step tasks. It addresses common limitations in existing approaches, such as supervised fine-tuning's reliance on successful trajectories and reinforcement learning's sparse rewards, which offer insufficient guidance for recovering from tool failures. ReGRPO introduces a structured reflective data engine that collects grounded failure observations from near-miss actions. This data is used to construct Reflection-of-Thought (RoT) triplets, comprising ErrorType, Evidence, and FixPlan, which are then paired with corrected actions for warm-start supervised fine-tuning. The framework further optimizes reflection tokens and corrective actions within local trajectories using group-relative advantages, while a reflection-cost term minimizes unnecessary reflection. Evaluations on GTA and GAIA benchmarks demonstrate that ReGRPO consistently surpasses strong open-source baselines, achieving superior performance among open-source controllers using identical backbones and tool suites.

Key takeaway

For Machine Learning Engineers developing tool-augmented vision-language models, ReGRPO provides a robust framework to significantly improve agent reliability and failure recovery. If your current SFT or RL approaches struggle with tool failures, you should investigate incorporating structured reflective data collection and joint optimization of reflection tokens. This method directly addresses the fragility of existing systems, offering a clear path to more resilient and autonomous tool-using agents in complex multimodal tasks.

Key insights

ReGRPO improves tool-using VLMs by learning reflection-guided recovery from failures using structured error data and joint optimization.

Principles

Structured reflection data improves agent recovery.
Jointly optimize reflection and corrective actions.
Cost term reduces unnecessary reflection.

Method

ReGRPO collects failure observations, builds Reflection-of-Thought (RoT) triplets (ErrorType, Evidence, FixPlan) with corrected actions for SFT, then optimizes reflection tokens and actions using group-relative advantages and a reflection-cost term.

In practice

Collect near-miss actions for failure data.
Generate ErrorType, Evidence, FixPlan triplets.
Apply reflection-cost to reduce overhead.

Topics

Tool-Augmented VLMs
Policy Optimization
Reflection-of-Thought
Error Recovery
Multimodal Agents
Reinforcement Learning

Code references

showlab/ReGRPO

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.