Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

PERception-Interaction-reason Agent (PERIA) is a new tool-augmented visual agent designed to overcome vision-language models' limitations in complex spatial reasoning tasks. It integrates two lightweight tool families: vision perception tools for extracting textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context and verifying spatial relations. PERIA's training employs a unified recipe combining supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO). Experiments across 13 benchmarks from 8 datasets show PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution tasks and 4.4% on out-of-distribution tasks. It also surpasses previous baselines of similar size by 7.0%–14.8%, achieving performance comparable to much larger models like Qwen3-VL-235B-A22B-Thinking and GPT-5.

Key takeaway

For machine learning engineers developing vision-language models for spatial reasoning, you should consider integrating tool-augmented agents like PERIA. This approach, combining specialized perception and interaction tools with advanced reinforcement learning like OR-GIGPO, significantly boosts performance on complex tasks. You can achieve results comparable to much larger proprietary models, even with smaller backbones, by focusing on robust tool-use training rather than just model scaling.

Key insights

Tool-augmented visual agents, trained with observation-relaxed policy optimization, significantly enhance spatial reasoning in VLMs.

Principles

Spatial reasoning demands active evidence acquisition.
Raw tool access needs dedicated tool-use training.
Observation-relaxed RL improves multi-step tool learning.

Method

PERIA's method involves a modular tool sandbox, synthesizing tool-use trajectories with explore-and-exploit, and optimizing with Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) using composite rewards.

In practice

Integrate perception tools for global evidence.
Utilize interaction tools for local visual analysis.
Apply OR-GIGPO for multi-step tool-use credit.

Topics

Vision-Language Models
Spatial Reasoning
Tool-Augmented Agents
Reinforcement Learning
OR-GIGPO
Multimodal Reasoning

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.