Thinking with Visual Grounding

2026-06-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Visually grounded thinking is introduced as a novel reasoning process for vision-language models (VLMs), enabling them to interleave natural-language thoughts with explicit point or box groundings of visual evidence. This approach makes intermediate reasoning verifiable by tying it to specific image regions. To achieve this, a scalable synthesis pipeline was developed, which distills correct visual reasoning traces, extracts necessary visual objects, grounds them using a SAM3-based agent, and generates aligned point and box supervision. Training incorporates grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards. When applied to Gemma3-4B-IT, this method consistently improved performance across two counting and four spatial reasoning benchmarks. Notably, on spatial reasoning, the visually grounded thinking 4B models matched or surpassed Gemma3-27B-IT. Analysis indicates point grounding is effective for counting, while box grounding benefits spatial tasks with explicit rewards, demonstrating that VLMs reason more effectively when thoughts are visually tied.

Key takeaway

For Machine Learning Engineers developing vision-language models, integrating visually grounded thinking can significantly improve reasoning accuracy and verifiability. You should consider implementing explicit point or box groundings for intermediate thoughts, coupled with grounding-aware reinforcement learning. This method, shown to boost Gemma3-4B-IT's performance on counting and spatial reasoning, provides a clear path to more robust and interpretable VLM outputs, particularly for applications demanding precise visual evidence.

Key insights

Vision-language models enhance reasoning by explicitly grounding natural-language thoughts to specific image regions.

Principles

Visual evidence makes VLM reasoning verifiable.
Explicit visual grounding consistently improves VLM performance.
Grounding types (point/box) should align with task needs.

Method

A scalable synthesis pipeline distills reasoning traces, extracts visual objects, grounds them via SAM3, and derives supervision. Grounding-aware reinforcement learning combines answer correctness with dense grounding rewards.

In practice

Apply point grounding for counting benchmarks.
Utilize box grounding for spatial reasoning tasks.
Incorporate explicit grounding rewards for spatial tasks.

Topics

Visual Grounding
Vision-Language Models
Reinforcement Learning
Spatial Reasoning
Counting Tasks
Gemma3
SAM3

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.