Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Iterative Visual Thinking (IVT) is a closed-loop framework designed to teach Vision-Language Models (VLMs) spatial self-correction through visual feedback. While VLMs like Qwen3-VL-4B-Instruct achieve strong single-shot spatial grounding (79.6% Acc@0.5), naively prompting them to iterate on rendered predictions causes catastrophic failure, dropping to 48.7% Acc@0.5 (a 31 percentage point collapse). IVT addresses this "self-correction gap" with a two-phase training recipe. First, an SFT warm-up phase uses the base model's own predictions as realistic errors to generate supervised data, enabling the model to surpass the single-shot baseline, reaching 82.0% Acc@0.5 (+2.4pp), 74.1% Acc@0.7 (+3.2pp), and 48.3% Acc@0.9 (+2.8pp). Second, Group Relative Policy Optimization (GRPO) fine-tuning stabilizes multi-step refinement, reducing per-step IoU degradation by 5x. This capability is instilled using only 2,400 samples on a single GPU.

Key takeaway

For machine learning engineers developing vision-language models for spatial grounding, recognize that native VLM capabilities do not extend to self-correction. You should implement a two-phase training approach, starting with supervised fine-tuning on self-generated error trajectories to enable iterative refinement, then applying reinforcement learning like GRPO to stabilize the correction process. This strategy can improve accuracy and robustness, especially for complex or ambiguous localization tasks.

Key insights

VLMs require explicit training to interpret visual feedback for spatial self-correction, as native iteration causes performance degradation.

Principles

Method

A two-phase training: SFT warm-up uses student model errors for teacher-generated corrective traces, followed by GRPO fine-tuning with a simple IoU reward for stability.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.