Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning
Summary
The paper introduces Skill-3D, a framework for agentic 3D spatial reasoning that addresses the issue of multimodal large language model (MLLM) agents applying uniform tool-use strategies across heterogeneous 3D scenes. Skill-3D learns self-evolving scene-aware skills by recording tool-use trajectories into a Scene Memory. Successful trajectories from similar scenes are distilled into reusable skills, while failures are attached as lessons. This approach significantly improves tool utilization in 3D spatial reasoning, increasing it from 39% to 78% on VSI-Bench. It boosts Gemini-3-Flash performance by 67% on MMSI-Bench and Qwen3-VL-8B by 43% on VSI-Bench through skill-guided agentic post-training. Skill-3D also reduces average inference time to 20.8s on VSI-Bench, compared to Think3D's 35.1s.
Key takeaway
For AI Scientists and Machine Learning Engineers developing MLLM agents for 3D spatial reasoning, you should consider implementing scene-aware skill learning. This approach, exemplified by Skill-3D, significantly enhances tool utilization and reasoning accuracy by adapting tool-use strategies to specific scene contexts. Integrating a dynamic skill library and agentic post-training can lead to substantial performance gains, such as the 43% boost observed on Qwen3-VL-8B for VSI-Bench, while also improving inference efficiency.
Key insights
MLLM agents improve 3D spatial reasoning by dynamically learning and applying scene-aware tool-use skills.
Principles
- 3D reasoning tasks are scene-heterogeneous.
- Aggregate successful tool-use trajectories.
- Learn from failed tool-use attempts.
Method
Skill-3D records tool-use trajectories into a Scene Memory, distills successful ones into a Skill Library, and attaches failures as lessons. Skills guide agent inference and are refined through a co-evolutionary loop.
In practice
- Implement a Scene Memory for tool-use.
- Distill successful workflows into skills.
- Use failed attempts for skill refinement.
Topics
- Agentic AI
- 3D Spatial Reasoning
- Multimodal Large Language Models
- Tool Use
- Skill Learning
- Reinforcement Learning
- Scene Understanding
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.