VISUALSKILL: Multimodal Skills for Computer-Use Agents

2026-05-20 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

VisualSkill introduces a hierarchical multimodal skill artifact designed for computer-use agents (CUAs) to overcome limitations in long-horizon tasks and unseen software. Unlike existing text-only skill libraries, VisualSkill integrates visual figures alongside textual procedures, tailored to each target application. It employs a two-stage pipeline, combining authored documentation mining with live-application UI exploration. Evaluated on CUA-World and OSExpert-Eval, a Claude Code CLI agent using Claude Opus 4.6 achieved an average score of 0.456 with VisualSkill, representing a +15.3 point absolute increase over the 0.303 no-skill baseline. Crucially, it yielded an +8.3 point absolute gain over a matched text-only skill (0.373), demonstrating that visual figures enhance UI element identification and workflow state verification. The system uses a load_topic MCP tool to fetch relevant text and figures on demand.

Key takeaway

For AI Engineers developing computer-use agents, relying solely on text-based skill libraries will limit performance on complex or novel UIs. You should prioritize implementing multimodal skill artifacts that embed visual figures directly alongside procedural text. This approach, especially when combined with a two-stage construction pipeline that includes live UI exploration, demonstrably improves agent accuracy and UI element identification, leading to more robust and generalizable CUA solutions.

Key insights

Multimodal skills, integrating visual figures, significantly enhance computer-use agents' ability to interact with graphical user interfaces.

Principles

Visual figures are critical for UI element identification.
Hierarchical skill indexing enables on-demand topic retrieval.
Combine documentation mining with live UI exploration.

Method

VisualSkill builds hierarchical multimodal skills using a two-stage pipeline: mining authored documentation (Stage 1) and enriching with live application UI exploration (Stage 2), including free and trajectory-targeted passes. Skills are loaded on demand via an MCP tool.

In practice

Embed UI screenshots directly into agent skill guides.
Implement on-demand skill loading via an MCP tool.
Use failed task trajectories to target skill enrichment.

Topics

VisualSkill
Computer-Use Agents
Multimodal Skills
GUI Interaction
Agent Benchmarking
LLM Tool Use

Code references

XMHZZ2018/VisualSkills

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.