Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Skill-3D is a novel framework designed to enhance agentic 3D spatial reasoning by addressing limitations in existing methods, which often exhibit tool misuse and biased preferences, leading to only marginal gains. The core issue identified is that current agents apply a uniform tool-use strategy across diverse 3D scenes and tasks. Skill-3D tackles this by learning self-evolving scene-aware skills. It records an agent's tool-use trajectory within a Scene Memory, aggregating successful paths from similar scenes into reusable skills and attaching failures as lessons. This memory and skill library co-evolve: during training, relevant skills guide the agent, and subsequent trajectories refine these skills. Experimental results demonstrate substantial improvements, with tool utilization on VSI-Bench increasing from 39% to 78%. Furthermore, Skill-3D boosts Gemini-3-Flash performance by 67% on MMSI-Bench, and agentic post-training improves Qwen3-VL-8B by 43% on VSI-Bench.

Key takeaway

For Machine Learning Engineers developing agentic 3D spatial reasoning systems, recognize that uniform tool-use strategies are suboptimal for heterogeneous 3D scenes. You should consider implementing a framework like Skill-3D to learn and evolve scene-aware skills. This approach, which aggregates successful trajectories and learns from failures, can substantially improve tool utilization and agent performance, as demonstrated by a 67% boost for Gemini-3-Flash on MMSI-Bench. Prioritize dynamic, context-dependent tool selection over static methods.

Key insights

Skill-3D improves agentic 3D spatial reasoning by evolving scene-aware skills through a co-evolving memory and skill library.

Principles

Method

Skill-3D identifies task scenes, records tool-use trajectories into Scene Memory, aggregates successful paths into reusable skills, and attaches failures as lessons. Skills guide agents, and new trajectories refine the co-evolving memory and skill library.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.