MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

MAS-Bench is a new benchmark designed to evaluate GUI-shortcut hybrid agents, specifically focusing on the mobile domain. It addresses the gap in systematically benchmarking agents that combine flexible Graphical User Interface (GUI) operations with efficient shortcuts like APIs, deep links, and Robotic Process Automation (RPA) scripts. The benchmark features 139 complex tasks across 11 real-world Android applications and includes a knowledge base of 88 predefined shortcuts. Beyond evaluating the use of predefined shortcuts, MAS-Bench assesses an agent's ability to autonomously generate new, reusable workflows. Experiments using the Gemini-2.5-Pro model show that hybrid agents achieve a 64.1% success rate, significantly outperforming GUI-only agents (44.6%), and demonstrate over 40% greater efficiency. The benchmark also reveals a performance gap between robust predefined shortcuts and less reliable agent-generated ones, highlighting future research areas.

Key takeaway

For Research Scientists developing mobile GUI agents, you should prioritize integrating hybrid GUI-shortcut operations to significantly improve task success rates and operational efficiency. Your focus should extend beyond merely utilizing existing shortcuts to developing robust frameworks for autonomously generating new, efficient shortcuts, especially for repetitive sub-tasks. This approach will lead to more capable and adaptable agents, particularly benefiting less powerful base models.

Key insights

Hybrid GUI-shortcut agents significantly boost mobile task success and efficiency over GUI-only approaches.

Principles

Method

MAS-Bench evaluates agents in a dynamic Android environment using 139 tasks, 88 predefined shortcuts (APIs, deep links, RPA scripts), and 7 metrics, including a framework for assessing agent-generated shortcut quality.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.