Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

2026-05-15 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, quick

Summary

Video2GUI is an automated framework designed to synthesize large-scale graphical user interface (GUI) interaction trajectories from unlabeled Internet videos. This framework addresses the scarcity of diverse, real-world training data for GUI agents, which typically rely on costly manual annotations and are confined to narrow domains. By applying a coarse-to-fine filtering strategy, Video2GUI identifies high-quality GUI tutorial videos and converts them into structured agent trajectories. Utilizing this pipeline on 500 million video metadata entries, the researchers constructed WildGUI, a dataset comprising 12 million interaction trajectories across more than 1,500 applications and websites. Pre-training models like Qwen2.5-VL and Mimo-VL on WildGUI resulted in consistent performance improvements of 5-20% on various GUI grounding and action benchmarks, achieving or exceeding state-of-the-art results.

Key takeaway

For research scientists developing generalized GUI agents, the WildGUI dataset and Video2GUI pipeline offer a critical resource to overcome data scarcity. You should consider integrating WildGUI for pre-training your models, as it has demonstrated 5-20% performance improvements on benchmarks. This approach allows for training more robust agents across a wider array of real-world applications without relying on expensive manual annotations.

Key insights

Automated extraction of GUI interaction data from unlabeled videos significantly enhances agent pretraining.

Principles

Unlabeled video data is a scalable resource.
Coarse-to-fine filtering improves data quality.

Method

Video2GUI extracts GUI interaction trajectories from unlabeled Internet videos using a coarse-to-fine filtering strategy to identify and convert high-quality tutorial content into structured agent trajectories.

In practice

Pre-train GUI agents with WildGUI dataset.
Apply Video2GUI pipeline for data synthesis.

Topics

GUI Agents
Video2GUI Framework
WildGUI Dataset
Interaction Trajectories
Multimodal LLMs

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.