KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
Summary
KnowU-Bench is a new online benchmark designed to evaluate personalized mobile agents in a reproducible Android emulation environment. It addresses limitations of prior benchmarks by focusing on interactive preference elicitation, proactive assistance, and consent handling, rather than static preference recovery or fixed intent prediction. The benchmark includes 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Crucially, KnowU-Bench hides user profiles from agents, requiring them to infer preferences from behavioral logs and engage in multi-turn clarification dialogues via an LLM-driven user simulator. It evaluates the complete proactive decision chain, from GUI execution to consent negotiation and post-rejection restraint, using a hybrid rule-based and LLM-as-a-Judge scoring protocol. Initial experiments show that even frontier models like Claude Sonnet 4.6 perform below 50% on tasks requiring preference inference or intervention calibration, highlighting a significant gap in current agent capabilities.
Key takeaway
For research scientists developing personalized mobile agents, you should prioritize building systems capable of genuine preference inference through interactive dialogue and robust proactive decision-making. Your evaluation metrics must extend beyond GUI navigation to include multi-turn preference elicitation, consent negotiation, and appropriate restraint after rejection, as current frontier models demonstrate significant weaknesses in these areas. This shift is critical for developing trustworthy and effective personal assistants.
Key insights
Evaluating personalized mobile agents requires dynamic preference inference and proactive interaction, not just static context lookup.
Principles
- User profiles should be hidden for genuine preference inference.
- Proactive agents need to negotiate consent and respect rejections.
Method
KnowU-Bench uses an Android emulation, an LLM-driven user simulator for multi-turn elicitation, and a hybrid rule-based/LLM-as-a-Judge protocol to evaluate personalized and proactive mobile agents.
In practice
- Test agents on dynamic preference acquisition.
- Implement consent negotiation in proactive systems.
- Focus on intervention calibration for agents.
Topics
- KnowU-Bench
- Mobile Agent Evaluation
- Personalized Agents
- Proactive Assistance
- LLM-driven User Simulator
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.