PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

PhoneHarness is introduced as a novel mixed-action benchmark and execution harness designed to evaluate phone-use agents on verifiable mobile workflows. Current mobile-agent literature often focuses on GUI control, but real phone tasks demand agents decide between app GUIs, device-side commands, or structured tools, with verifiable side effects. PhoneHarness addresses this by running a device-side agent loop that integrates GUI, CLI, and host-side tool actions, featuring deterministic action routing, bounded GUI delegation, and auditable execution traces. Its accompanying benchmark, PhoneHarness Bench, assesses task completion via observable side effects, not just plausible final answers. On the annotated evaluation split, PhoneHarness achieves a 75.0% pass rate, surpassing the strongest non-PhoneHarness settings by 12.9 percentage points. This indicates that reliable phone automation relies on action-surface routing and verifiable execution, rather than solely visual GUI control.

Key takeaway

For AI Engineers developing mobile automation agents, you should prioritize designing systems that integrate mixed action types—GUI, CLI, and structured tools—rather than focusing solely on visual GUI control. Your evaluation metrics must shift to verifiable task completion based on observable side effects, not just plausible final states. This approach, demonstrated by PhoneHarness's 75.0% pass rate, is critical for achieving reliable and safe phone automation in real-world workflows.

Key insights

Reliable phone automation requires agents to integrate GUI, CLI, and tool actions with verifiable execution, moving beyond mere GUI control.

Principles

Method

PhoneHarness runs a device-side agent loop combining GUI, CLI, and host-side tool actions. It uses deterministic action routing, bounded GUI delegation, and auditable execution traces to verify task completion.

In practice

Topics

Best for: Research Scientist, Machine Learning Engineer, AI Scientist, AI Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.