AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

2026-05-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

AQuaUI is a novel, training-free inference-time token reduction method designed for Graphical User Interface (GUI) agent models, addressing the challenge of high-resolution screenshots introducing excessive visual tokens. This approach leverages the non-uniform spatial information density in GUI screenshots by constructing an adaptive quadtree, retaining one representative merged token per leaf. To enhance temporal consistency across multi-step interactions, AQuaUI also employs a conditional quadtree algorithm that refines partitions using previous screenshots. Experiments on state-of-the-art GUI agent models, including GUI-Owl-1.5-32B-Instruct, demonstrate significant improvements: up to 13.22% speedup and 29.52% fewer visual tokens while preserving 99.06% of full-token performance. The method consistently shows better accuracy-efficiency trade-offs compared to prior baselines on standard grounding and navigational benchmarks, particularly benefiting larger models like Qwen3-VL-8B and Qwen3-VL-32B where latency gains are more pronounced.

Key takeaway

For AI Engineers developing or deploying GUI agent models, you should consider integrating AQuaUI to significantly reduce inference costs and improve throughput. This training-free method offers up to 13.22% speedup and 29.52% fewer visual tokens on models like GUI-Owl-1.5-32B-Instruct, while maintaining 99.06% performance. It is especially beneficial for larger backbones where language model computation dominates overhead. Evaluate AQuaUI to optimize your agent's efficiency without retraining.

Key insights

AQuaUI reduces visual tokens for GUI agents by exploiting screenshot spatial redundancy with adaptive quadtrees.

Principles

GUI screenshots have non-uniform information density.
Spatial structure can be exploited for token reduction.
Temporal consistency improves multi-step GUI interactions.

Method

AQuaUI builds an adaptive quadtree on GUI screenshots, selecting one representative token per leaf. A conditional quadtree refines partitions using previous frames for temporal consistency.

In practice

Apply adaptive quadtrees for GUI token reduction.
Use conditional quadtrees for multi-step GUI tasks.
Prioritize variance-based splitting for stability.

Topics

GUI Agents
Visual Token Reduction
Adaptive Quadtrees
Large Multimodal Models
Inference Optimization
Temporal Consistency

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.