Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback
Summary
Iterative Visual Thinking (IVT) is a closed-loop framework designed to teach Vision-Language Models (VLMs) spatial self-correction through visual feedback. While VLMs like Qwen3-VL-4B-Instruct achieve strong single-shot spatial grounding (79.6% Acc@0.5), naively prompting them to iterate on rendered predictions causes catastrophic failure, dropping to 48.7% Acc@0.5 (a 31 percentage point collapse). IVT addresses this "self-correction gap" with a two-phase training recipe. First, an SFT warm-up phase uses the base model's own predictions as realistic errors to generate supervised data, enabling the model to surpass the single-shot baseline, reaching 82.0% Acc@0.5 (+2.4pp), 74.1% Acc@0.7 (+3.2pp), and 48.3% Acc@0.9 (+2.8pp). Second, Group Relative Policy Optimization (GRPO) fine-tuning stabilizes multi-step refinement, reducing per-step IoU degradation by 5x. This capability is instilled using only 2,400 samples on a single GPU.
Key takeaway
For machine learning engineers developing vision-language models for spatial grounding, recognize that native VLM capabilities do not extend to self-correction. You should implement a two-phase training approach, starting with supervised fine-tuning on self-generated error trajectories to enable iterative refinement, then applying reinforcement learning like GRPO to stabilize the correction process. This strategy can improve accuracy and robustness, especially for complex or ambiguous localization tasks.
Key insights
VLMs require explicit training to interpret visual feedback for spatial self-correction, as native iteration causes performance degradation.
Principles
- Strong grounding doesn't imply self-correction.
- Visual feedback needs explicit interpretation training.
- RL alone cannot discover self-correction.
Method
A two-phase training: SFT warm-up uses student model errors for teacher-generated corrective traces, followed by GRPO fine-tuning with a simple IoU reward for stability.
In practice
- Use student predictions for SFT data.
- Apply GRPO for refinement stability.
- Consider adaptive refinement for hard cases.
Topics
- Vision-Language Models
- Spatial Grounding
- Self-Correction
- Iterative Visual Thinking
- Reinforcement Learning
- Supervised Fine-tuning
- Referring Expression Comprehension
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.