MMSkills: Towards Multimodal Skills for General Visual Agents
Summary
The MMSkills framework introduces a novel approach for representing, generating, and utilizing reusable multimodal procedures to enhance runtime visual decision-making in general visual agents. Unlike existing skill packages that rely on textual prompts or code, MMSkills packages are compact, state-conditioned units that combine textual procedures with runtime state cards and multi-view keyframes. The framework addresses challenges in defining multimodal skill content, deriving packages from public interaction data, and enabling efficient agent consultation of multimodal evidence during inference. An agentic trajectory-to-skill Generator constructs these packages by transforming non-evaluation trajectories through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. A branch-loaded multimodal skill agent then uses these skills by inspecting selected state cards and keyframes, aligning them with the live environment, and distilling structured guidance for the main agent. Experiments on GUI and game-based benchmarks demonstrate that MMSkills consistently improves both frontier and smaller multimodal agents.
Key takeaway
For research scientists developing general visual agents, MMSkills offers a robust method to integrate multimodal procedural knowledge, moving beyond text-only skill representations. You should consider adopting this framework to improve agent capabilities, especially in tasks requiring complex visual decision-making. Implementing MMSkills can lead to more consistent performance gains across various agent sizes and benchmarks, complementing your model's internal priors with external, visually-grounded guidance.
Key insights
MMSkills enhances visual agents by integrating multimodal procedural knowledge, combining text with visual state and keyframes.
Principles
- Procedural knowledge is inherently multimodal for visual agents.
- External multimodal knowledge complements model-internal priors.
Method
The MMSkills framework generates skills from trajectories via workflow grouping, procedure induction, visual grounding, and meta-skill auditing, then uses a branch-loaded agent for runtime guidance.
In practice
- Represent skills with text, state cards, and keyframes.
- Derive skills from public interaction trajectories.
Topics
- MMSkills Framework
- Multimodal Procedural Knowledge
- General Visual Agents
- Skill Generation
- Runtime Decision Making
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.