Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Iterative Visual Thinking (IVT) is a closed-loop framework designed to teach Vision-Language Models (VLMs) spatial self-correction through visual feedback. While VLMs like Qwen3-VL-4B-Instruct achieve strong single-shot spatial grounding (79.6% Acc@0.5), naively prompting them to iterate on rendered predictions causes catastrophic failure, dropping to 48.7% Acc@0.5 (a 31 percentage point collapse). IVT addresses this "self-correction gap" with a two-phase training recipe. First, an SFT warm-up phase uses the base model's own predictions as realistic errors to generate supervised data, enabling the model to surpass the single-shot baseline, reaching 82.0% Acc@0.5 (+2.4pp), 74.1% Acc@0.7 (+3.2pp), and 48.3% Acc@0.9 (+2.8pp). Second, Group Relative Policy Optimization (GRPO) fine-tuning stabilizes multi-step refinement, reducing per-step IoU degradation by 5x. This capability is instilled using only 2,400 samples on a single GPU.

Key takeaway

For machine learning engineers developing vision-language models for spatial grounding, recognize that native VLM capabilities do not extend to self-correction. You should implement a two-phase training approach, starting with supervised fine-tuning on self-generated error trajectories to enable iterative refinement, then applying reinforcement learning like GRPO to stabilize the correction process. This strategy can improve accuracy and robustness, especially for complex or ambiguous localization tasks.

Key insights

VLMs require explicit training to interpret visual feedback for spatial self-correction, as native iteration causes performance degradation.

Principles

Strong grounding doesn't imply self-correction.
Visual feedback needs explicit interpretation training.
RL alone cannot discover self-correction.

Method

A two-phase training: SFT warm-up uses student model errors for teacher-generated corrective traces, followed by GRPO fine-tuning with a simple IoU reward for stability.

In practice

Use student predictions for SFT data.
Apply GRPO for refinement stability.
Consider adaptive refinement for hard cases.

Topics

Vision-Language Models
Spatial Grounding
Self-Correction
Iterative Visual Thinking
Reinforcement Learning
Supervised Fine-tuning
Referring Expression Comprehension

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.