Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Workflow-GYM is a new benchmark designed to rigorously evaluate AI agents on long-horizon, real-world professional computer-use tasks involving graphical user interfaces and specialized software. Comprising 338 tasks across 6 top-level domains and 23 subdomains, it features workflows requiring 30 to 110 atomic actions. Experiments with models like GPT-5.4-xhigh, Gemini-3.1-Pro, and Kimi-k2.6 reveal that even the strongest models achieve only slightly above 30% success rates. This highlights significant challenges for current GUI agents in maintaining long-horizon workflow consistency, managing error propagation, preventing objective drift, and understanding professional software environments. The benchmark exposes a substantial gap between current agent capabilities and the demands of economically valuable, end-to-end professional work.

Key takeaway

For AI Engineers developing GUI agents for professional applications, recognize that current models achieve only ~30% success on long-horizon, domain-specific workflows. Your development efforts should prioritize mitigating error propagation, objective drift, and software knowledge deficiencies. Crucially, explore architectures that support continuous visual feedback during GUI manipulation, as the discrete observation paradigm is a fundamental bottleneck for complex, fine-grained operations. Incorporating structured procedural guidance, like step-by-step text or video, can also significantly improve agent reliability.

Key insights

Current AI agents struggle with long-horizon, domain-specific GUI workflows, achieving only ~30% success.

Principles

Method

Workflow-GYM tasks are sourced from domain experts, filtered for realism, domain-specificity, complexity (≥30 actions), and verifiability, then instantiated in virtual machines with expert-provided instructions and procedures.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.