Xiaomi-GUI-0 Technical Report
Summary
Xiaomi-GUI-0 is a native multimodal graphical user interface (GUI) agent designed for real mobile environments, aiming to close the gap between benchmark scores and real-world usability. It completes user tasks end-to-end through interface actions like tapping and text entry. The agent is trained and evaluated within a real-device closed loop, utilizing a real-device-dominant hybrid infrastructure where physical devices are the primary execution environment. Its training data is multi-source, covering high-frequency tasks, long-tail intents, and capability-enhancement data for reflection and memory, augmented by an error-driven data flywheel. The model undergoes a three-stage progressive pipeline: supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, significantly improving execution stability and abnormal-state recognition in real tasks.
Key takeaway
For AI Engineers developing GUI agents for mobile applications, recognize that traditional benchmarks often misrepresent real-world performance due to dynamic interface layouts and abnormal states. You should prioritize a real-device closed-loop training and evaluation strategy, coupled with an error-driven data flywheel, to achieve robust execution stability and superior abnormal-state recognition. This approach moves beyond simulated environments, ensuring your agents are truly deployment-ready for complex, real-world mobile interactions.
Key insights
Real-device closed-loop training and error-driven data are crucial for robust GUI agent performance in dynamic mobile environments.
Principles
- Offline trajectories and simulated environments misrepresent real-world GUI agent performance.
- Real-device execution environments are essential for characterizing execution stability.
- An error-driven data flywheel improves agent performance by converting failures into corrected actions.
Method
Xiaomi-GUI-0 employs a progressive three-stage training pipeline: supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning, integrated with an error-driven data flywheel.
In practice
- Develop GUI agents using a real-device-dominant hybrid infrastructure for data collection and evaluation.
- Implement an error-driven data flywheel to generate corrected actions and recovery demonstrations from agent failures.
Topics
- GUI Agents
- Mobile AI
- Reinforcement Learning
- Real-Device Training
- Multimodal AI
- Error-Driven Learning
Best for: AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.