ScreenSearch: Uncertainty-Aware OS Exploration
Summary
ScreenSearch is a system designed for uncertainty-aware desktop operating system (OS) exploration, addressing the challenge of partial observability in GUI agents where visually similar screens can represent different underlying states. The system combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit to navigate and explore large-scale desktop environments. Its retrieval layer converts UIA trees into location-aware structural features, indexes related screens, and maintains a shared deduplicated state graph across multiple virtual machine workers. ScreenSearch defines a scalable ambiguity signal based on matched-action outcome dispersion, probing states further if similar screens yield different next states under identical actions. Across 11 desktop applications, ScreenSearch collected over 1 million screenshots and more than 30,000 deduplicated states, generating extensive exploration corpora. Evaluation shows a trade-off between novelty and ambiguity reduction, indicating that both are crucial for effective exploration, and that stronger proposal priors significantly improve unique-state discovery.
Key takeaway
Research Scientists developing GUI agents should integrate both frontier expansion and ambiguity reduction into their exploration strategies. Relying solely on novelty can lead to superficial UI changes, while only reducing ambiguity can become overly localized. Your agent's ability to distinguish between visually similar but functionally different states, using techniques like matched-action outcome dispersion, will be critical for robust and efficient desktop environment navigation.
Key insights
Effective GUI exploration requires balancing screen novelty with ambiguity reduction to navigate partial observability.
Principles
- Visually similar screens can hide different workflow states.
- Ambiguity arises from inconsistent outcomes for identical actions.
- Stronger priors improve state discovery efficiency.
Method
ScreenSearch uses structural screen retrieval, deduplication, and an ambiguity-aware PUCT graph-bandit. It converts UIA trees to structural features, indexes screens, and maintains a shared state graph, guiding exploration with novelty and ambiguity signals.
In practice
- Use UIA trees for structural screen representation.
- Implement Jaccard overlap for near-duplicate verification.
- Combine novelty and ambiguity for exploration objectives.
Topics
- Desktop GUI Agents
- OS State Exploration
- Partial Observability
- Structural Screen Retrieval
- Ambiguity-Aware Search
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.