DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

2026-05-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

DRS-GUI is a training-free framework designed to improve GUI grounding for Multimodal Large Language Models (MLLMs) by dynamically searching for instruction-relevant regions in high-resolution screenshots. It introduces a lightweight UI Perceptor that executes three human-like perceptual actions: Focus, Shift, and Scatter, to progressively explore the interface and generate region proposals. An Action Planner, based on Monte Carlo Tree Search (MCTS), dynamically schedules these actions, guided by a region quality reward that evaluates and selects highly instruction-relevant areas, pruning redundant UI elements. Experiments on ScreenSpot-Pro, ScreenSpot-V1, and ScreenSpot-V2 benchmarks demonstrate that DRS-GUI yields significant improvements, including a 14% accuracy gain on ScreenSpot-Pro for MLLMs like Qwen2.5-VL-7B and UGround-V1-7B, without requiring additional model training or fine-tuning.

Key takeaway

For research scientists developing GUI agents, DRS-GUI offers a robust, training-free method to significantly improve grounding accuracy, especially on complex, high-resolution interfaces. You should consider integrating this dynamic region search framework into your MLLM pipelines to mitigate visual clutter and enhance localization stability without the need for extensive model retraining or fine-tuning, thereby making your agents more reliable and interpretable.

Key insights

DRS-GUI enhances MLLM GUI grounding via a training-free dynamic region search and MCTS-based action planning.

Principles

Adaptive perceptual scope improves grounding.
Step-wise region evaluation prevents error accumulation.
Interactive elements are better grounding targets.

Method

DRS-GUI uses a UI Perceptor for Focus, Shift, and Scatter actions, scheduled by an MCTS Action Planner, guided by a composite region quality reward (relevance, coverage, concentration) to identify optimal grounding regions.

In practice

Integrate DRS-GUI as a plug-and-play MLLM enhancement.
Use MCTS with a budget of N=8 iterations for efficiency.
Prioritize interactive elements in region relevance scoring.

Topics

GUI Grounding
Dynamic Region Search
Multimodal Large Language Models
Monte Carlo Tree Search
UI Perceptor

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.