AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

AQuaUI is a novel, training-free inference-time token reduction method designed for Graphical User Interface (GUI) agent models, addressing the challenge of high-resolution screenshots introducing excessive visual tokens. This approach leverages the non-uniform spatial information density in GUI screenshots by constructing an adaptive quadtree, retaining one representative merged token per leaf. To enhance temporal consistency across multi-step interactions, AQuaUI also employs a conditional quadtree algorithm that refines partitions using previous screenshots. Experiments on state-of-the-art GUI agent models, including GUI-Owl-1.5-32B-Instruct, demonstrate significant improvements: up to 13.22% speedup and 29.52% fewer visual tokens while preserving 99.06% of full-token performance. The method consistently shows better accuracy-efficiency trade-offs compared to prior baselines on standard grounding and navigational benchmarks, particularly benefiting larger models like Qwen3-VL-8B and Qwen3-VL-32B where latency gains are more pronounced.

Key takeaway

For AI Engineers developing or deploying GUI agent models, you should consider integrating AQuaUI to significantly reduce inference costs and improve throughput. This training-free method offers up to 13.22% speedup and 29.52% fewer visual tokens on models like GUI-Owl-1.5-32B-Instruct, while maintaining 99.06% performance. It is especially beneficial for larger backbones where language model computation dominates overhead. Evaluate AQuaUI to optimize your agent's efficiency without retraining.

Key insights

AQuaUI reduces visual tokens for GUI agents by exploiting screenshot spatial redundancy with adaptive quadtrees.

Principles

Method

AQuaUI builds an adaptive quadtree on GUI screenshots, selecting one representative token per leaf. A conditional quadtree refines partitions using previous frames for temporal consistency.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.