Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The paper introduces Skill-3D, a framework for agentic 3D spatial reasoning that addresses the issue of multimodal large language model (MLLM) agents applying uniform tool-use strategies across heterogeneous 3D scenes. Skill-3D learns self-evolving scene-aware skills by recording tool-use trajectories into a Scene Memory. Successful trajectories from similar scenes are distilled into reusable skills, while failures are attached as lessons. This approach significantly improves tool utilization in 3D spatial reasoning, increasing it from 39% to 78% on VSI-Bench. It boosts Gemini-3-Flash performance by 67% on MMSI-Bench and Qwen3-VL-8B by 43% on VSI-Bench through skill-guided agentic post-training. Skill-3D also reduces average inference time to 20.8s on VSI-Bench, compared to Think3D's 35.1s.

Key takeaway

For AI Scientists and Machine Learning Engineers developing MLLM agents for 3D spatial reasoning, you should consider implementing scene-aware skill learning. This approach, exemplified by Skill-3D, significantly enhances tool utilization and reasoning accuracy by adapting tool-use strategies to specific scene contexts. Integrating a dynamic skill library and agentic post-training can lead to substantial performance gains, such as the 43% boost observed on Qwen3-VL-8B for VSI-Bench, while also improving inference efficiency.

Key insights

MLLM agents improve 3D spatial reasoning by dynamically learning and applying scene-aware tool-use skills.

Principles

3D reasoning tasks are scene-heterogeneous.
Aggregate successful tool-use trajectories.
Learn from failed tool-use attempts.

Method

Skill-3D records tool-use trajectories into a Scene Memory, distills successful ones into a Skill Library, and attaches failures as lessons. Skills guide agent inference and are refined through a co-evolutionary loop.

In practice

Implement a Scene Memory for tool-use.
Distill successful workflows into skills.
Use failed attempts for skill refinement.

Topics

Agentic AI
3D Spatial Reasoning
Multimodal Large Language Models
Tool Use
Skill Learning
Reinforcement Learning
Scene Understanding

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.