Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents
Summary
A new approach, Teach VLM, and the Teach-and-Repeat paradigm, address the challenge of accurately extracting operational knowledge from mobile screen demonstrations. Teach VLM, a core model built on Qwen3-VL-8B-Instruct, translates mobile screen trajectories into step-wise natural-language operational knowledge by analyzing keyframes. It is trained using a systematic Data Flywheel for scalable data acquisition and evaluated with the new Chinese Mobile Screen Teach Benchmark. Teach VLM significantly outperforms existing vision-language models in operation semantics prediction, achieving 58.84% Operation Semantic Accuracy on Android-In-The-Zoo and 42.99% on GUIOdyssey, surpassing the strongest baseline by up to 26.44 percentage points. The Teach-and-Repeat paradigm then uses this extracted knowledge as an interpretable procedural reference, improving Task Success Rates for downstream execution agents in Android World by +7.33 to +11.21 percentage points across various backbones.
Key takeaway
For Machine Learning Engineers developing GUI agents, integrating demonstration-derived operational knowledge can significantly boost task success rates. You should consider implementing a "teach" phase to extract natural-language procedural references from user demonstrations, which then guide your execution agents. This approach reduces planning difficulty and mitigates common issues like ineffective exploration loops, allowing you to focus on strengthening low-level execution capabilities.
Key insights
Teach VLM accurately extracts step-wise operational knowledge from mobile screen demonstrations, enhancing GUI agent performance.
Principles
- Operational knowledge improves task planning.
- Decouple knowledge extraction from execution.
- Iterative data flywheel enhances VLM accuracy.
Method
The Teach VLM uses a data flywheel for iterative training, involving keyframe extraction, VLM pre-annotation, auto-evaluation with manual feedback, and model retraining. This generates natural-language operation descriptions.
In practice
- Use keyframe extraction to filter video noise.
- Inject operational knowledge as procedural reference.
- Employ auto-evaluation for scalable data labeling.
Topics
- GUI Agents
- Operational Knowledge Extraction
- Mobile Screen Understanding
- Vision-Language Models
- Data Flywheel
- Task Automation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.