Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback
Summary
The Iterative Visual Thinking (IVT) framework addresses a critical limitation in Vision-Language Models (VLMs): their inability to self-correct spatial grounding predictions. Naive iteration causes a catastrophic drop in Acc@0.5 from 79.6% to 48.7%. IVT introduces a closed-loop system where a VLM predicts a bounding box, observes its rendering, and refines it through visual feedback. A two-phase training recipe, involving a teacher VLM generating corrective reasoning from base model errors (without human annotation) and Group Relative Policy Optimization (GRPO) with an IoU reward, closes this self-correction gap. On a mixed benchmark (RefCOCOg, Ref-Adv, Ref-L4, 505 samples), IVT improves Acc@0.5 to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, with training requiring only 2,400 samples on a single GPU.
Key takeaway
For Machine Learning Engineers developing Vision-Language Models for spatial grounding, you should integrate iterative visual feedback mechanisms to overcome the inherent self-correction gap. This approach significantly boosts accuracy metrics like Acc@0.5 to 82.0% and stabilizes refinement, achievable with modest training resources. Consider adopting the two-phase training with GRPO to instill this capability efficiently.
Key insights
Iterative Visual Thinking enables VLMs to self-correct spatial predictions through visual feedback, closing a critical performance gap.
Principles
- Naive VLM iteration fails catastrophically.
- Self-correction is a learnable VLM capability.
- Visual feedback drives spatial refinement.
Method
IVT uses a two-phase training: first, a teacher VLM generates corrective reasoning from base model errors; second, Group Relative Policy Optimization (GRPO) stabilizes multi-step refinement with IoU reward.
In practice
- Train self-correction with 2,400 samples.
- Use base model errors for synthetic data.
- Apply GRPO for stable multi-step refinement.
Topics
- Vision-Language Models
- Spatial Grounding
- Self-Correction
- Iterative Visual Thinking
- Group Relative Policy Optimization
- Bounding Box Refinement
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.