PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions
Summary
PhoneHarness is introduced as a novel mixed-action benchmark and execution harness designed to evaluate phone-use agents on verifiable mobile workflows. Current mobile-agent literature often focuses on GUI control, but real phone tasks demand agents decide between app GUIs, device-side commands, or structured tools, with verifiable side effects. PhoneHarness addresses this by running a device-side agent loop that integrates GUI, CLI, and host-side tool actions, featuring deterministic action routing, bounded GUI delegation, and auditable execution traces. Its accompanying benchmark, PhoneHarness Bench, assesses task completion via observable side effects, not just plausible final answers. On the annotated evaluation split, PhoneHarness achieves a 75.0% pass rate, surpassing the strongest non-PhoneHarness settings by 12.9 percentage points. This indicates that reliable phone automation relies on action-surface routing and verifiable execution, rather than solely visual GUI control.
Key takeaway
For AI Engineers developing mobile automation agents, you should prioritize designing systems that integrate mixed action types—GUI, CLI, and structured tools—rather than focusing solely on visual GUI control. Your evaluation metrics must shift to verifiable task completion based on observable side effects, not just plausible final states. This approach, demonstrated by PhoneHarness's 75.0% pass rate, is critical for achieving reliable and safe phone automation in real-world workflows.
Key insights
Reliable phone automation requires agents to integrate GUI, CLI, and tool actions with verifiable execution, moving beyond mere GUI control.
Principles
- Mobile agents need mixed-action capabilities.
- Verifiable side effects are crucial for evaluation.
- Action-surface routing enhances reliability.
Method
PhoneHarness runs a device-side agent loop combining GUI, CLI, and host-side tool actions. It uses deterministic action routing, bounded GUI delegation, and auditable execution traces to verify task completion.
In practice
- Evaluate agents on observable side effects.
- Design agents for mixed GUI, CLI, tool actions.
- Implement auditable execution traces for tasks.
Topics
- Mobile Agents
- PhoneHarness
- GUI Automation
- CLI Automation
- Workflow Automation
- Verifiable Execution
Best for: Research Scientist, Machine Learning Engineer, AI Scientist, AI Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.