DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

2026-04-07 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

DragOn is a new benchmark and training dataset designed to improve vision-based GUI agents' performance on complex drag-based interactions, an area where current models significantly underperform compared to click-grounding tasks. The dataset comprises 286K training screenshots and 3.5M training tasks, alongside a 2,000-example held-out evaluation suite. It covers four distinct drag grounding domains: text highlighting, cell selection, element resizing, and slider manipulation. Evaluations on proprietary models like GPT-5.4 and Claude Opus 4.7, and open-weight models such as Qwen and Kimi-K2.5, show that frontier models score below 30% overall. However, a Qwen VLM fine-tuned on DragOn achieved a 35.3% success rate, a 33-point absolute improvement, surpassing all evaluated frontier models and demonstrating the dataset's effectiveness in enhancing drag grounding capabilities.

Key takeaway

For AI Engineers developing GUI agents, you should prioritize integrating robust drag grounding capabilities. Current frontier models struggle with complex drag interactions, scoring below 30% on the DragOn benchmark. Fine-tuning your vision-language models with specialized datasets like DragOn can yield substantial performance gains, as demonstrated by a 33-point improvement over generalist baselines. Consider adopting "rendering-as-supervision" for efficient, pixel-accurate data generation to enhance your agent's real-world usability.

Key insights

DragOn addresses the critical gap in drag grounding data for GUI agents, significantly improving VLM performance.

Principles

"Rendering-as-supervision" yields pixel-exact annotations.
Drag grounding is crucial for real-world computer-use agents.
Targeted data fine-tuning outperforms generalist frontier models.

Method

Data is constructed by exploiting renderer geometry (PDF, XLSX, PPTX, HTML) as a labeling function, using analytic or probe-based label maps.

In practice

Fine-tune VLMs on drag-specific datasets.
Prioritize drag grounding for GUI agent development.
Use rendering-as-supervision for data generation.

Topics

GUI Agents
Drag Grounding
Vision-Language Models
Dataset Benchmarking
Rendering-as-Supervision
Computer-Use Automation

Code references

hcompai/DragOn

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.