ScreenSearch: Uncertainty-Aware OS Exploration

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ScreenSearch is a system designed for uncertainty-aware desktop operating system (OS) exploration, addressing the challenge of partial observability in GUI agents where visually similar screens can represent different underlying states. The system combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit to navigate and explore large-scale desktop environments. Its retrieval layer converts UIA trees into location-aware structural features, indexes related screens, and maintains a shared deduplicated state graph across multiple virtual machine workers. ScreenSearch defines a scalable ambiguity signal based on matched-action outcome dispersion, probing states further if similar screens yield different next states under identical actions. Across 11 desktop applications, ScreenSearch collected over 1 million screenshots and more than 30,000 deduplicated states, generating extensive exploration corpora. Evaluation shows a trade-off between novelty and ambiguity reduction, indicating that both are crucial for effective exploration, and that stronger proposal priors significantly improve unique-state discovery.

Key takeaway

Research Scientists developing GUI agents should integrate both frontier expansion and ambiguity reduction into their exploration strategies. Relying solely on novelty can lead to superficial UI changes, while only reducing ambiguity can become overly localized. Your agent's ability to distinguish between visually similar but functionally different states, using techniques like matched-action outcome dispersion, will be critical for robust and efficient desktop environment navigation.

Key insights

Effective GUI exploration requires balancing screen novelty with ambiguity reduction to navigate partial observability.

Principles

Method

ScreenSearch uses structural screen retrieval, deduplication, and an ambiguity-aware PUCT graph-bandit. It converts UIA trees to structural features, indexes screens, and maintains a shared state graph, guiding exploration with novelty and ambiguity signals.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.