Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Summary
Workflow-GYM is a new benchmark designed to rigorously evaluate AI agents on long-horizon, real-world professional computer-use tasks involving graphical user interfaces and specialized software. Comprising 338 tasks across 6 top-level domains and 23 subdomains, it features workflows requiring 30 to 110 atomic actions. Experiments with models like GPT-5.4-xhigh, Gemini-3.1-Pro, and Kimi-k2.6 reveal that even the strongest models achieve only slightly above 30% success rates. This highlights significant challenges for current GUI agents in maintaining long-horizon workflow consistency, managing error propagation, preventing objective drift, and understanding professional software environments. The benchmark exposes a substantial gap between current agent capabilities and the demands of economically valuable, end-to-end professional work.
Key takeaway
For AI Engineers developing GUI agents for professional applications, recognize that current models achieve only ~30% success on long-horizon, domain-specific workflows. Your development efforts should prioritize mitigating error propagation, objective drift, and software knowledge deficiencies. Crucially, explore architectures that support continuous visual feedback during GUI manipulation, as the discrete observation paradigm is a fundamental bottleneck for complex, fine-grained operations. Incorporating structured procedural guidance, like step-by-step text or video, can also significantly improve agent reliability.
Key insights
Current AI agents struggle with long-horizon, domain-specific GUI workflows, achieving only ~30% success.
Principles
- Long-horizon GUI tasks amplify failure modes like error propagation and objective drift.
- Discrete observation-action paradigms fundamentally limit continuous GUI manipulation.
- Agentic framework configuration is highly coupled with model architecture for GUI tasks.
Method
Workflow-GYM tasks are sourced from domain experts, filtered for realism, domain-specificity, complexity (≥30 actions), and verifiability, then instantiated in virtual machines with expert-provided instructions and procedures.
In practice
- Providing textual step-by-step procedures substantially improves agent performance.
- Video tutorials offer additional gains for fine-grained interaction details.
- Focus agent development on continuous visual feedback for complex GUI operations.
Topics
- GUI Agents
- Long-Horizon Tasks
- Professional Workflows
- Benchmark Evaluation
- Multimodal AI
- Failure Analysis
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.