SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

SpatialClaw is a novel, training-free framework designed to enhance spatial reasoning in vision-language models (VLMs) by rethinking the action interface for tool-augmented agents. Addressing limitations of single-pass code execution and structured tool-call interfaces, SpatialClaw employs a persistent Python kernel pre-loaded with input frames and a suite of perception and geometry primitives. A VLM-backed agent iteratively writes and executes code cells, allowing flexible composition, inspection, and revision of perception results based on intermediate text and visual observations. Evaluated across 20 diverse spatial reasoning benchmarks, SpatialClaw achieved an average accuracy of 59.9%, surpassing recent spatial agents by +11.2 points. These significant gains were consistent across six VLM backbones, ranging from 27B to 397B parameters, without any model- or benchmark-specific adaptation.

Key takeaway

For AI Scientists and Machine Learning Engineers developing VLM-based agents for complex 3D/4D spatial reasoning, you should re-evaluate your agent's action interface. Relying on single-pass code or structured tool-calls limits flexibility and iterative refinement. Implementing a persistent Python kernel where your VLM can iteratively generate, execute, and revise code based on intermediate visual and textual feedback, as demonstrated by SpatialClaw, can significantly boost accuracy and generalization across diverse benchmarks without extensive fine-tuning.

Key insights

Adopting code as an iterative action interface in a persistent kernel significantly enhances VLM spatial reasoning capabilities.

Principles

Action interface design critically impacts agent spatial reasoning.
Code as an orchestration space enables flexible tool composition.
Persistent kernel state allows iterative analysis revision.

Method

SpatialClaw coordinates a VLM agent through a five-stage loop: planning, generating one Python cell per step, executing in a persistent kernel, assembling feedback (text, variables, images), and submitting answers, guided by a unified system prompt.

In practice

Implement a persistent Python kernel for iterative tool composition.
Integrate scientific libraries for dynamic numerical computation.
Provide visual feedback for intermediate perception results.

Topics

Spatial Reasoning
Vision-Language Models
AI Agents
Code as Action Interface
Persistent Kernel
3D/4D Perception

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.