RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

The Resource2Skill framework distills human-created multimodal resources, including tutorial videos, code repositories, articles, and reference artifacts, into executable skills for software agents. These skills are organized within a hierarchical multimodal Skill Wiki, where each entry combines structured text, code, visual examples, metadata, and provenance. This design leverages complementary signals from diverse sources, enabling agents to retrieve and compose relevant skills at inference time, with an option for online acquisition of new skills when needed. Evaluated across seven authoring domains like slide design, web page generation, and 3D scene creation, Resource2Skill achieved an average overall score improvement of +11.9 percentage points over no-skill agents. It also outperformed strong agentic-harness baselines in 26 of 28 model–domain cells, demonstrating the significant impact of structured, multimodal procedural knowledge.

Key takeaway

For AI Engineers and Architects deploying LLM agents in complex software environments, integrating a structured, multimodal skill library is crucial. Resource2Skill demonstrates that distilling procedural knowledge from diverse human-created resources, particularly tutorial videos, into an executable Skill Wiki significantly improves agent performance by +11.9 percentage points. You should prioritize building or adopting such hierarchical, extensible skill libraries to enhance agent reliability and reduce the need for extensive prompt engineering or agent trace-based learning.

Key insights

Distilling multimodal human-created resources into executable, reusable skills significantly enhances software agent performance.

Principles

Method

Resource2Skill constructs a hierarchical multimodal Skill Wiki from diverse resources, then agents use a two-stage MetaBrowse (lexical + LM) for selection, and execute skills via an MCP interface, with online acquisition for gaps.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.