MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?
Summary
The MMG2Skill framework addresses the challenge of converting "in-the-wild" human procedural guides, which are often multimodal, noisy, and human-oriented, into agent-executable skills for long-horizon tasks. It introduces MMG2Skill-Bench, the first benchmark for this guide-to-skill learning problem. MMG2Skill is a closed-loop system that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent during execution, and revises skills using trajectory-level root-cause feedback without relying on benchmark scores. Across GUI control, open-ended gameplay, and strategic card play, MMG2Skill consistently outperforms vanilla baseline agents, achieving macro-average gains of +12.8 to +25.3 percentage points across six VLM backbones. Ablation studies confirm that structured skill construction and trajectory-driven revision are essential, while direct prompting with raw guides can degrade performance. Analyzer-based early stopping further saves 25%-53% of attempts on success-inferable tasks.
Key takeaway
For AI Engineers developing autonomous agents that rely on human-generated instructions, you should prioritize structured skill distillation and iterative self-revision over direct prompting with raw, multimodal guides. This approach significantly boosts agent performance, demonstrating macro-average gains of +12.8 to +25.3 percentage points, and can prevent late-stage performance regressions, especially when incorporating analyzer-based early stopping to save 25%-53% of attempts.
Key insights
MMG2Skill enables agents to distill and self-evolve executable skills from diverse, real-world human guides, improving task performance.
Principles
- Structured skill construction is vital.
- Trajectory-driven skill revision improves performance.
- Raw guide prompting can hinder agents.
Method
MMG2Skill compiles human guides into editable skills, conditions a fixed VLM agent for execution, and revises skills using trajectory-level root-cause feedback, without relying on benchmark scores.
In practice
- Apply structured skill compilation for agents.
- Implement feedback loops for skill revision.
- Utilize early stopping for efficiency.
Topics
- Guide-to-Skill Learning
- Autonomous Agents
- Vision-Language Models
- Skill Acquisition
- Trajectory-Driven Revision
- MMG2Skill-Bench
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.