SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
Summary
SpatialClaw is a novel, training-free framework designed to enhance spatial reasoning in vision-language models (VLMs) by rethinking the action interface for tool-augmented agents. Addressing limitations of single-pass code execution and structured tool-call interfaces, SpatialClaw employs a persistent Python kernel pre-loaded with input frames and a suite of perception and geometry primitives. A VLM-backed agent iteratively writes and executes code cells, allowing flexible composition, inspection, and revision of perception results based on intermediate text and visual observations. Evaluated across 20 diverse spatial reasoning benchmarks, SpatialClaw achieved an average accuracy of 59.9%, surpassing recent spatial agents by +11.2 points. These significant gains were consistent across six VLM backbones, ranging from 27B to 397B parameters, without any model- or benchmark-specific adaptation.
Key takeaway
For AI Scientists and Machine Learning Engineers developing VLM-based agents for complex 3D/4D spatial reasoning, you should re-evaluate your agent's action interface. Relying on single-pass code or structured tool-calls limits flexibility and iterative refinement. Implementing a persistent Python kernel where your VLM can iteratively generate, execute, and revise code based on intermediate visual and textual feedback, as demonstrated by SpatialClaw, can significantly boost accuracy and generalization across diverse benchmarks without extensive fine-tuning.
Key insights
Adopting code as an iterative action interface in a persistent kernel significantly enhances VLM spatial reasoning capabilities.
Principles
- Action interface design critically impacts agent spatial reasoning.
- Code as an orchestration space enables flexible tool composition.
- Persistent kernel state allows iterative analysis revision.
Method
SpatialClaw coordinates a VLM agent through a five-stage loop: planning, generating one Python cell per step, executing in a persistent kernel, assembling feedback (text, variables, images), and submitting answers, guided by a unified system prompt.
In practice
- Implement a persistent Python kernel for iterative tool composition.
- Integrate scientific libraries for dynamic numerical computation.
- Provide visual feedback for intermediate perception results.
Topics
- Spatial Reasoning
- Vision-Language Models
- AI Agents
- Code as Action Interface
- Persistent Kernel
- 3D/4D Perception
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.