Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
Summary
GUI-SD is a novel on-policy self-distillation (OPSD) framework designed for Graphical User Interface (GUI) grounding, which maps natural language instructions to visual coordinates. It addresses limitations of traditional reinforcement learning methods like GRPO, which require expensive multiple rollouts and suffer from sparse signals. GUI-SD employs a visually enriched privileged context for its teacher model, utilizing a target bounding box and a Gaussian soft mask to provide informative guidance without revealing exact coordinates. Additionally, it incorporates entropy-guided distillation, adaptively weighting tokens based on digit significance and teacher confidence to focus optimization on the most impactful and reliable positions. Experiments across six GUI grounding benchmarks, including ScreenSpot-v2, UI-Vision, and OSWorld-G, demonstrate that GUI-SD consistently surpasses GRPO-based methods and naive OPSD in both accuracy and training efficiency.
Key takeaway
For research scientists developing autonomous GUI agents, GUI-SD offers a more efficient and accurate alternative to GRPO-based methods. You should consider implementing on-policy self-distillation with visually enriched contexts and entropy-guided loss to overcome sparse reward issues and improve training efficiency in GUI grounding tasks. This approach can lead to more robust and precise coordinate generation.
Key insights
GUI-SD improves GUI grounding via on-policy self-distillation with visual context and entropy-guided token weighting.
Principles
- Dense token-level supervision improves training efficiency.
- Privileged visual context guides without leaking exact coordinates.
- Adaptive weighting focuses optimization on significant tokens.
Method
GUI-SD constructs a visually enriched privileged context using bounding boxes and Gaussian soft masks, combined with an instructional hint. It then applies an entropy-guided distillation loss, weighting tokens by position-based significance and teacher confidence.
In practice
- Use bounding boxes and soft masks for visual guidance.
- Prioritize higher-order digits in coordinate prediction.
- Modulate supervision strength by teacher confidence.
Topics
- GUI Grounding
- On-Policy Self-Distillation
- GRPO Methods
- Visually Privileged Context
- Entropy-Guided Distillation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.