iOSWorld: A Benchmark for Personally Intelligent Phone Agents
Summary
iOSWorld is introduced as the first interactive native iOS simulator benchmark designed to evaluate personally intelligent phone agents. Addressing the lack of personalization in existing mobile agent benchmarks, iOSWorld features a persistent user identity across 26 newly built iOS apps containing connected data like transactions, messages, and financial activity. The benchmark comprises 133 tasks categorized into single-app (27), multi-app (60, spanning 2-8 apps), and memory and personalization tasks (46) that require inferring patterns from personal data. Evaluations of frontier and open-source computer-use models, using both vision-only and privileged vision+XML settings, show the best configuration achieving 52% overall, but only 37% on multi-app tasks. Notably, privileged vision+XML access boosts frontier models by up to 26 percentage points, while smaller models gain no benefit. iOSWorld is released open-source, including all apps, seeded data, tasks, rubrics, and evaluation code.
Key takeaway
For Machine Learning Engineers developing mobile AI agents, this benchmark highlights a critical gap in personalization. You should prioritize agent architectures capable of reasoning over persistent, on-device user identity and multi-app data. Focus your development efforts on improving performance in complex multi-app and personalization scenarios, where current models achieve only 37%. Consider integrating privileged vision+XML access for your frontier models to significantly boost their capabilities in these environments.
Key insights
Personally intelligent phone agents require benchmarks that simulate persistent user identity and complex, multi-app interactions.
Principles
- Phone agents must reason over on-device user identity and history.
- Existing mobile agent benchmarks lack personalization capabilities.
- Privileged vision+XML access significantly improves frontier models.
Method
iOSWorld constructs an interactive native iOS simulator with 26 apps, seeded personal data, and 133 tasks across single-app, multi-app, and memory/personalization categories for agent evaluation.
In practice
- Evaluate agent performance on multi-app and personalization tasks.
- Integrate privileged vision+XML access for frontier models.
- Develop agents capable of inferring patterns from personal data.
Topics
- iOSWorld
- Mobile Agents
- AI Benchmarking
- Personalization
- iOS Simulator
- Multi-app Interaction
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.