Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents

2026-06-05 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A new approach, Teach VLM, and the Teach-and-Repeat paradigm, address the challenge of accurately extracting operational knowledge from mobile screen demonstrations. Teach VLM, a core model built on Qwen3-VL-8B-Instruct, translates mobile screen trajectories into step-wise natural-language operational knowledge by analyzing keyframes. It is trained using a systematic Data Flywheel for scalable data acquisition and evaluated with the new Chinese Mobile Screen Teach Benchmark. Teach VLM significantly outperforms existing vision-language models in operation semantics prediction, achieving 58.84% Operation Semantic Accuracy on Android-In-The-Zoo and 42.99% on GUIOdyssey, surpassing the strongest baseline by up to 26.44 percentage points. The Teach-and-Repeat paradigm then uses this extracted knowledge as an interpretable procedural reference, improving Task Success Rates for downstream execution agents in Android World by +7.33 to +11.21 percentage points across various backbones.

Key takeaway

For Machine Learning Engineers developing GUI agents, integrating demonstration-derived operational knowledge can significantly boost task success rates. You should consider implementing a "teach" phase to extract natural-language procedural references from user demonstrations, which then guide your execution agents. This approach reduces planning difficulty and mitigates common issues like ineffective exploration loops, allowing you to focus on strengthening low-level execution capabilities.

Key insights

Teach VLM accurately extracts step-wise operational knowledge from mobile screen demonstrations, enhancing GUI agent performance.

Principles

Operational knowledge improves task planning.
Decouple knowledge extraction from execution.
Iterative data flywheel enhances VLM accuracy.

Method

The Teach VLM uses a data flywheel for iterative training, involving keyframe extraction, VLM pre-annotation, auto-evaluation with manual feedback, and model retraining. This generates natural-language operation descriptions.

In practice

Use keyframe extraction to filter video noise.
Inject operational knowledge as procedural reference.
Employ auto-evaluation for scalable data labeling.

Topics

GUI Agents
Operational Knowledge Extraction
Mobile Screen Understanding
Vision-Language Models
Data Flywheel
Task Automation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.