PhoneWorld: Scaling Phone-Use Agent Environments
Summary
PhoneWorld is a reusable pipeline designed to address the bottleneck in creating scalable, reproducible phone-use agent environments. It converts real GUI trajectories and screenshots into controllable environments, executable tasks, automatic verifiers, and training rollouts. This system builds runnable mock Android apps from read-only content and mutable state, deriving tasks and verifiers. PhoneWorld currently covers 34 apps across 16 domains, including common consumer behaviors like search, browsing, and shopping. When used to replace 10K steps from an auxiliary AndroidWorld corpus, PhoneWorld supervision significantly improves evaluation benchmarks, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. Further scaling studies show that increasing PhoneWorld supervision and expanding app coverage lead to even greater performance gains.
Key takeaway
For AI Engineers developing mobile agents, PhoneWorld offers a critical shift from manual benchmark creation to scalable environment generation. You should consider integrating PhoneWorld's pipeline to convert real GUI trajectories into diverse training rollouts and verifiable tasks. This approach can significantly improve your agent's performance across various mobile benchmarks, especially by expanding app coverage and increasing supervision, accelerating development cycles.
Key insights
PhoneWorld scales phone-use agent environment creation by converting real GUI trajectories into runnable, verifiable tasks and training rollouts.
Principles
- Real GUI trajectories inform environment construction.
- Scaling supervision improves agent performance.
- Broader app coverage yields larger gains.
Method
PhoneWorld converts real GUI trajectories and screenshots into controllable environments, executable tasks, automatic verifiers, and training rollouts, building mock Android apps from recovered screen connections and interaction states.
In practice
- Generate diverse mobile agent training data.
- Evaluate agents across 34 apps, 16 domains.
- Improve mobile agent benchmark scores.
Topics
- Phone-Use Agents
- Mobile Agent Environments
- GUI Trajectories
- Android Apps
- Benchmark Evaluation
- Scalable Training Data
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.