AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
Summary
AQuaUI is a novel, training-free inference-time token reduction method designed for Graphical User Interface (GUI) agent models, addressing the challenge of high-resolution screenshots introducing excessive visual tokens. This approach leverages the non-uniform spatial information density in GUI screenshots by constructing an adaptive quadtree, retaining one representative merged token per leaf. To enhance temporal consistency across multi-step interactions, AQuaUI also employs a conditional quadtree algorithm that refines partitions using previous screenshots. Experiments on state-of-the-art GUI agent models, including GUI-Owl-1.5-32B-Instruct, demonstrate significant improvements: up to 13.22% speedup and 29.52% fewer visual tokens while preserving 99.06% of full-token performance. The method consistently shows better accuracy-efficiency trade-offs compared to prior baselines on standard grounding and navigational benchmarks, particularly benefiting larger models like Qwen3-VL-8B and Qwen3-VL-32B where latency gains are more pronounced.
Key takeaway
For AI Engineers developing or deploying GUI agent models, you should consider integrating AQuaUI to significantly reduce inference costs and improve throughput. This training-free method offers up to 13.22% speedup and 29.52% fewer visual tokens on models like GUI-Owl-1.5-32B-Instruct, while maintaining 99.06% performance. It is especially beneficial for larger backbones where language model computation dominates overhead. Evaluate AQuaUI to optimize your agent's efficiency without retraining.
Key insights
AQuaUI reduces visual tokens for GUI agents by exploiting screenshot spatial redundancy with adaptive quadtrees.
Principles
- GUI screenshots have non-uniform information density.
- Spatial structure can be exploited for token reduction.
- Temporal consistency improves multi-step GUI interactions.
Method
AQuaUI builds an adaptive quadtree on GUI screenshots, selecting one representative token per leaf. A conditional quadtree refines partitions using previous frames for temporal consistency.
In practice
- Apply adaptive quadtrees for GUI token reduction.
- Use conditional quadtrees for multi-step GUI tasks.
- Prioritize variance-based splitting for stability.
Topics
- GUI Agents
- Visual Token Reduction
- Adaptive Quadtrees
- Large Multimodal Models
- Inference Optimization
- Temporal Consistency
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.