Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining
Summary
Video2GUI is an automated framework designed to extract grounded GUI interaction trajectories from unlabeled internet videos, addressing the scarcity of large-scale training data for graphical user interface (GUI) agents. The framework utilizes a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, the researchers constructed WildGUI, a large-scale dataset comprising 12 million interaction trajectories across more than 1,500 applications and websites. Pre-training models like Qwen2.5-VL and Mimo-VL on WildGUI resulted in consistent performance improvements of 5-20% on various GUI grounding and action benchmarks, achieving or exceeding state-of-the-art results. The WildGUI dataset and Video2GUI pipeline will be released to support future research.
Key takeaway
For research scientists developing GUI agents, the WildGUI dataset and Video2GUI pipeline offer a critical resource for overcoming data scarcity. You should consider integrating WildGUI into your pre-training regimens for multimodal large language models to achieve substantial performance gains, potentially matching or surpassing current state-of-the-art benchmarks.
Key insights
Automated extraction of GUI interaction data from unlabeled videos significantly improves GUI agent pre-training.
Principles
- Unlabeled video data is a rich source for GUI trajectories.
- Coarse-to-fine filtering enhances data quality.
Method
Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos from 500 million metadata entries and converts them into structured interaction trajectories for agent training.
In practice
- Pre-train multimodal LLMs on WildGUI for GUI tasks.
- Utilize Video2GUI for custom GUI data generation.
Topics
- GUI Agents
- Multimodal LLMs
- Video2GUI Framework
- WildGUI Dataset
- Interaction Trajectories
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.