MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The MMG2Skill framework addresses the challenge of converting "in-the-wild" human procedural guides, which are often multimodal, noisy, and human-oriented, into agent-executable skills for long-horizon tasks. It introduces MMG2Skill-Bench, the first benchmark for this guide-to-skill learning problem. MMG2Skill is a closed-loop system that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent during execution, and revises skills using trajectory-level root-cause feedback without relying on benchmark scores. Across GUI control, open-ended gameplay, and strategic card play, MMG2Skill consistently outperforms vanilla baseline agents, achieving macro-average gains of +12.8 to +25.3 percentage points across six VLM backbones. Ablation studies confirm that structured skill construction and trajectory-driven revision are essential, while direct prompting with raw guides can degrade performance. Analyzer-based early stopping further saves 25%-53% of attempts on success-inferable tasks.

Key takeaway

For AI Engineers developing autonomous agents that rely on human-generated instructions, you should prioritize structured skill distillation and iterative self-revision over direct prompting with raw, multimodal guides. This approach significantly boosts agent performance, demonstrating macro-average gains of +12.8 to +25.3 percentage points, and can prevent late-stage performance regressions, especially when incorporating analyzer-based early stopping to save 25%-53% of attempts.

Key insights

MMG2Skill enables agents to distill and self-evolve executable skills from diverse, real-world human guides, improving task performance.

Principles

Structured skill construction is vital.
Trajectory-driven skill revision improves performance.
Raw guide prompting can hinder agents.

Method

MMG2Skill compiles human guides into editable skills, conditions a fixed VLM agent for execution, and revises skills using trajectory-level root-cause feedback, without relying on benchmark scores.

In practice

Apply structured skill compilation for agents.
Implement feedback loops for skill revision.
Utilize early stopping for efficiency.

Topics

Guide-to-Skill Learning
Autonomous Agents
Vision-Language Models
Skill Acquisition
Trajectory-Driven Revision
MMG2Skill-Bench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.