PhoneWorld: Scaling Phone-Use Agent Environments

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

PhoneWorld is a reusable pipeline designed to address the bottleneck in creating scalable, reproducible phone-use agent environments. It converts real GUI trajectories and screenshots into controllable environments, executable tasks, automatic verifiers, and training rollouts. This system builds runnable mock Android apps from read-only content and mutable state, deriving tasks and verifiers. PhoneWorld currently covers 34 apps across 16 domains, including common consumer behaviors like search, browsing, and shopping. When used to replace 10K steps from an auxiliary AndroidWorld corpus, PhoneWorld supervision significantly improves evaluation benchmarks, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. Further scaling studies show that increasing PhoneWorld supervision and expanding app coverage lead to even greater performance gains.

Key takeaway

For AI Engineers developing mobile agents, PhoneWorld offers a critical shift from manual benchmark creation to scalable environment generation. You should consider integrating PhoneWorld's pipeline to convert real GUI trajectories into diverse training rollouts and verifiable tasks. This approach can significantly improve your agent's performance across various mobile benchmarks, especially by expanding app coverage and increasing supervision, accelerating development cycles.

Key insights

PhoneWorld scales phone-use agent environment creation by converting real GUI trajectories into runnable, verifiable tasks and training rollouts.

Principles

Real GUI trajectories inform environment construction.
Scaling supervision improves agent performance.
Broader app coverage yields larger gains.

Method

PhoneWorld converts real GUI trajectories and screenshots into controllable environments, executable tasks, automatic verifiers, and training rollouts, building mock Android apps from recovered screen connections and interaction states.

In practice

Generate diverse mobile agent training data.
Evaluate agents across 34 apps, 16 domains.
Improve mobile agent benchmark scores.

Topics

Phone-Use Agents
Mobile Agent Environments
GUI Trajectories
Android Apps
Benchmark Evaluation
Scalable Training Data

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.