Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, quick

Summary

Video2GUI is an automated framework designed to synthesize large-scale graphical user interface (GUI) interaction trajectories from unlabeled Internet videos. This framework addresses the scarcity of diverse, real-world training data for GUI agents, which typically rely on costly manual annotations and are confined to narrow domains. By applying a coarse-to-fine filtering strategy, Video2GUI identifies high-quality GUI tutorial videos and converts them into structured agent trajectories. Utilizing this pipeline on 500 million video metadata entries, the researchers constructed WildGUI, a dataset comprising 12 million interaction trajectories across more than 1,500 applications and websites. Pre-training models like Qwen2.5-VL and Mimo-VL on WildGUI resulted in consistent performance improvements of 5-20% on various GUI grounding and action benchmarks, achieving or exceeding state-of-the-art results.

Key takeaway

For research scientists developing generalized GUI agents, the WildGUI dataset and Video2GUI pipeline offer a critical resource to overcome data scarcity. You should consider integrating WildGUI for pre-training your models, as it has demonstrated 5-20% performance improvements on benchmarks. This approach allows for training more robust agents across a wider array of real-world applications without relying on expensive manual annotations.

Key insights

Automated extraction of GUI interaction data from unlabeled videos significantly enhances agent pretraining.

Principles

Method

Video2GUI extracts GUI interaction trajectories from unlabeled Internet videos using a coarse-to-fine filtering strategy to identify and convert high-quality tutorial content into structured agent trajectories.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.