DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions
Summary
DragOn is a new benchmark and training dataset designed to improve vision-based GUI agents' performance on complex drag-based interactions, an area where current models significantly underperform compared to click-grounding tasks. The dataset comprises 286K training screenshots and 3.5M training tasks, alongside a 2,000-example held-out evaluation suite. It covers four distinct drag grounding domains: text highlighting, cell selection, element resizing, and slider manipulation. Evaluations on proprietary models like GPT-5.4 and Claude Opus 4.7, and open-weight models such as Qwen and Kimi-K2.5, show that frontier models score below 30% overall. However, a Qwen VLM fine-tuned on DragOn achieved a 35.3% success rate, a 33-point absolute improvement, surpassing all evaluated frontier models and demonstrating the dataset's effectiveness in enhancing drag grounding capabilities.
Key takeaway
For AI Engineers developing GUI agents, you should prioritize integrating robust drag grounding capabilities. Current frontier models struggle with complex drag interactions, scoring below 30% on the DragOn benchmark. Fine-tuning your vision-language models with specialized datasets like DragOn can yield substantial performance gains, as demonstrated by a 33-point improvement over generalist baselines. Consider adopting "rendering-as-supervision" for efficient, pixel-accurate data generation to enhance your agent's real-world usability.
Key insights
DragOn addresses the critical gap in drag grounding data for GUI agents, significantly improving VLM performance.
Principles
- "Rendering-as-supervision" yields pixel-exact annotations.
- Drag grounding is crucial for real-world computer-use agents.
- Targeted data fine-tuning outperforms generalist frontier models.
Method
Data is constructed by exploiting renderer geometry (PDF, XLSX, PPTX, HTML) as a labeling function, using analytic or probe-based label maps.
In practice
- Fine-tune VLMs on drag-specific datasets.
- Prioritize drag grounding for GUI agent development.
- Use rendering-as-supervision for data generation.
Topics
- GUI Agents
- Drag Grounding
- Vision-Language Models
- Dataset Benchmarking
- Rendering-as-Supervision
- Computer-Use Automation
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.