iOSWorld: A Benchmark for Personally Intelligent Phone Agents

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

iOSWorld is introduced as the first interactive native iOS simulator benchmark designed to evaluate personally intelligent phone agents. Addressing the lack of personalization in existing mobile agent benchmarks, iOSWorld features a persistent user identity across 26 newly built iOS apps containing connected data like transactions, messages, and financial activity. The benchmark comprises 133 tasks categorized into single-app (27), multi-app (60, spanning 2-8 apps), and memory and personalization tasks (46) that require inferring patterns from personal data. Evaluations of frontier and open-source computer-use models, using both vision-only and privileged vision+XML settings, show the best configuration achieving 52% overall, but only 37% on multi-app tasks. Notably, privileged vision+XML access boosts frontier models by up to 26 percentage points, while smaller models gain no benefit. iOSWorld is released open-source, including all apps, seeded data, tasks, rubrics, and evaluation code.

Key takeaway

For Machine Learning Engineers developing mobile AI agents, this benchmark highlights a critical gap in personalization. You should prioritize agent architectures capable of reasoning over persistent, on-device user identity and multi-app data. Focus your development efforts on improving performance in complex multi-app and personalization scenarios, where current models achieve only 37%. Consider integrating privileged vision+XML access for your frontier models to significantly boost their capabilities in these environments.

Key insights

Personally intelligent phone agents require benchmarks that simulate persistent user identity and complex, multi-app interactions.

Principles

Method

iOSWorld constructs an interactive native iOS simulator with 26 apps, seeded personal data, and 133 tasks across single-app, multi-app, and memory/personalization categories for agent evaluation.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.