Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning
Summary
The PERception-Interaction-reason Agent (PERIA) is a novel tool-augmented visual agent designed to overcome the limitations of current vision-language models (VLMs) in complex spatial reasoning tasks. VLMs often struggle with tasks requiring active evidence acquisition and multi-step visual interaction due to insufficient implicit visual representations. PERIA addresses this by integrating two lightweight tool families: vision perception tools, which expose textual, symbolic, and spatial evidence, and vision interaction tools, enabling manipulation of visual context, path tracing, and spatial relation verification. Trained using a unified recipe combining supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO), PERIA-8B demonstrated significant improvements. It enhanced performance over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks across 13 benchmarks from 8 datasets. PERIA-8B also outperformed similar-sized baselines by 7.0%-14.8% and achieved performance comparable to much larger models like Qwen3-VL-235B-A22B-Thinking and GPT-5.
Key takeaway
For Machine Learning Engineers developing vision-language models, if your current models struggle with complex spatial reasoning, consider integrating tool-augmented visual agents like PERIA. This approach can significantly improve performance on tasks requiring active evidence acquisition and multi-step visual interaction. You should explore lightweight vision perception and interaction tools, and investigate training methods such as OR-GIGPO to enhance your model's spatial intelligence without relying solely on larger, more resource-intensive backbones.
Key insights
Tool-augmented visual agents enhance spatial reasoning by actively acquiring and interacting with visual evidence.
Principles
- Implicit visual representations limit fine-grained spatial reasoning.
- Active evidence acquisition is key for complex spatial tasks.
- Lightweight tool families extend VLM spatial capabilities.
Method
PERIA's training combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior.
In practice
- Employ vision perception tools for evidence extraction.
- Utilize vision interaction tools for visual context manipulation.
- Explore OR-GIGPO for multi-tool agent training.
Topics
- Tool-Augmented Agents
- Spatial Reasoning
- Vision-Language Models
- Multi-Tool Behavior
- OR-GIGPO
- Visual Perception
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.