Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning
Summary
PERception-Interaction-reason Agent (PERIA) is a new tool-augmented visual agent designed to overcome vision-language models' limitations in complex spatial reasoning tasks. It integrates two lightweight tool families: vision perception tools for extracting textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context and verifying spatial relations. PERIA's training employs a unified recipe combining supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO). Experiments across 13 benchmarks from 8 datasets show PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution tasks and 4.4% on out-of-distribution tasks. It also surpasses previous baselines of similar size by 7.0%–14.8%, achieving performance comparable to much larger models like Qwen3-VL-235B-A22B-Thinking and GPT-5.
Key takeaway
For machine learning engineers developing vision-language models for spatial reasoning, you should consider integrating tool-augmented agents like PERIA. This approach, combining specialized perception and interaction tools with advanced reinforcement learning like OR-GIGPO, significantly boosts performance on complex tasks. You can achieve results comparable to much larger proprietary models, even with smaller backbones, by focusing on robust tool-use training rather than just model scaling.
Key insights
Tool-augmented visual agents, trained with observation-relaxed policy optimization, significantly enhance spatial reasoning in VLMs.
Principles
- Spatial reasoning demands active evidence acquisition.
- Raw tool access needs dedicated tool-use training.
- Observation-relaxed RL improves multi-step tool learning.
Method
PERIA's method involves a modular tool sandbox, synthesizing tool-use trajectories with explore-and-exploit, and optimizing with Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) using composite rewards.
In practice
- Integrate perception tools for global evidence.
- Utilize interaction tools for local visual analysis.
- Apply OR-GIGPO for multi-step tool-use credit.
Topics
- Vision-Language Models
- Spatial Reasoning
- Tool-Augmented Agents
- Reinforcement Learning
- OR-GIGPO
- Multimodal Reasoning
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.