Xiaomi-GUI-0 Technical Report

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Internet of Things (IoT) & Connected Devices · Depth: Expert, quick

Summary

Xiaomi-GUI-0 is a native multimodal graphical user interface (GUI) agent designed for real mobile environments, aiming to close the gap between benchmark scores and real-world usability. It completes user tasks end-to-end through interface actions like tapping and text entry. The agent is trained and evaluated within a real-device closed loop, utilizing a real-device-dominant hybrid infrastructure where physical devices are the primary execution environment. Its training data is multi-source, covering high-frequency tasks, long-tail intents, and capability-enhancement data for reflection and memory, augmented by an error-driven data flywheel. The model undergoes a three-stage progressive pipeline: supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, significantly improving execution stability and abnormal-state recognition in real tasks.

Key takeaway

For AI Engineers developing GUI agents for mobile applications, recognize that traditional benchmarks often misrepresent real-world performance due to dynamic interface layouts and abnormal states. You should prioritize a real-device closed-loop training and evaluation strategy, coupled with an error-driven data flywheel, to achieve robust execution stability and superior abnormal-state recognition. This approach moves beyond simulated environments, ensuring your agents are truly deployment-ready for complex, real-world mobile interactions.

Key insights

Real-device closed-loop training and error-driven data are crucial for robust GUI agent performance in dynamic mobile environments.

Principles

Offline trajectories and simulated environments misrepresent real-world GUI agent performance.
Real-device execution environments are essential for characterizing execution stability.
An error-driven data flywheel improves agent performance by converting failures into corrected actions.

Method

Xiaomi-GUI-0 employs a progressive three-stage training pipeline: supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning, integrated with an error-driven data flywheel.

In practice

Develop GUI agents using a real-device-dominant hybrid infrastructure for data collection and evaluation.
Implement an error-driven data flywheel to generate corrected actions and recovery demonstrations from agent failures.

Topics

GUI Agents
Mobile AI
Reinforcement Learning
Real-Device Training
Multimodal AI
Error-Driven Learning

Best for: AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.