Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, quick

Summary

Video2GUI is an automated framework designed to extract grounded GUI interaction trajectories from unlabeled internet videos, addressing the scarcity of large-scale training data for graphical user interface (GUI) agents. The framework utilizes a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, the researchers constructed WildGUI, a large-scale dataset comprising 12 million interaction trajectories across more than 1,500 applications and websites. Pre-training models like Qwen2.5-VL and Mimo-VL on WildGUI resulted in consistent performance improvements of 5-20% on various GUI grounding and action benchmarks, achieving or exceeding state-of-the-art results. The WildGUI dataset and Video2GUI pipeline will be released to support future research.

Key takeaway

For research scientists developing GUI agents, the WildGUI dataset and Video2GUI pipeline offer a critical resource for overcoming data scarcity. You should consider integrating WildGUI into your pre-training regimens for multimodal large language models to achieve substantial performance gains, potentially matching or surpassing current state-of-the-art benchmarks.

Key insights

Automated extraction of GUI interaction data from unlabeled videos significantly improves GUI agent pre-training.

Principles

Method

Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos from 500 million metadata entries and converts them into structured interaction trajectories for agent training.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.