DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions
Summary
DragOn is a new benchmark and training dataset designed to advance GUI agents' capabilities in complex drag-based interactions. Addressing a significant gap where drag grounding data is an order of magnitude smaller than click-grounding, DragOn focuses on four critical domains: text highlighting, cell selection, element resizing, and slider manipulation. The comprehensive dataset comprises 286K training screenshots and 3.5M training tasks, alongside a 2000-example held-out evaluation suite. Researchers evaluated several leading models, including proprietary ones like GPT and Claude, and open-weight models such as Qwen, Kimi, and Holo. A Qwen Vision-Language Model (VLM) fine-tuned on the DragOn training data demonstrated improved performance, suggesting the dataset can enhance current models' effectiveness in downstream computer-use tasks.
Key takeaway
For AI Engineers developing GUI automation agents, if your models struggle with complex drag-based interactions, consider integrating the DragOn dataset. Fine-tuning your Vision-Language Models with DragOn's 286K screenshots and 3.5M tasks can improve performance on tasks like text highlighting and element resizing. Use this benchmark to validate your agent's capabilities and drive more robust, human-like digital task automation.
Key insights
DragOn provides a crucial, large-scale dataset and benchmark to advance GUI agents' drag-based interaction capabilities.
Principles
- Large-scale data is critical for complex GUI agent tasks.
- Diverse drag interaction types improve model generalization.
Method
DragOn was created by collecting 286K training screenshots and 3.5M tasks across four drag-based GUI domains, then evaluating proprietary and open-weight VLMs on a 2000-example suite.
In practice
- Fine-tune VLMs using the DragOn training dataset.
- Benchmark GUI agents on DragOn's evaluation suite.
Topics
- GUI Agents
- Drag Grounding
- Datasets
- Benchmarking
- Vision-Language Models
- Human-Computer Interaction
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.