KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

KnowU-Bench is a new online benchmark designed to evaluate personalized mobile agents in a reproducible Android emulation environment. It addresses limitations of prior benchmarks by focusing on interactive preference elicitation, proactive assistance, and consent handling, rather than static preference recovery or fixed intent prediction. The benchmark includes 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Crucially, KnowU-Bench hides user profiles from agents, requiring them to infer preferences from behavioral logs and engage in multi-turn clarification dialogues via an LLM-driven user simulator. It evaluates the complete proactive decision chain, from GUI execution to consent negotiation and post-rejection restraint, using a hybrid rule-based and LLM-as-a-Judge scoring protocol. Initial experiments show that even frontier models like Claude Sonnet 4.6 perform below 50% on tasks requiring preference inference or intervention calibration, highlighting a significant gap in current agent capabilities.

Key takeaway

For research scientists developing personalized mobile agents, you should prioritize building systems capable of genuine preference inference through interactive dialogue and robust proactive decision-making. Your evaluation metrics must extend beyond GUI navigation to include multi-turn preference elicitation, consent negotiation, and appropriate restraint after rejection, as current frontier models demonstrate significant weaknesses in these areas. This shift is critical for developing trustworthy and effective personal assistants.

Key insights

Evaluating personalized mobile agents requires dynamic preference inference and proactive interaction, not just static context lookup.

Principles

Method

KnowU-Bench uses an Android emulation, an LLM-driven user simulator for multi-turn elicitation, and a hybrid rule-based/LLM-as-a-Judge protocol to evaluate personalized and proactive mobile agents.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.