Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

2026-05-04 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

GUI-SD is a novel on-policy self-distillation (OPSD) framework designed for Graphical User Interface (GUI) grounding, which maps natural language instructions to visual coordinates. It addresses limitations of traditional reinforcement learning methods like GRPO, which require expensive multiple rollouts and suffer from sparse signals. GUI-SD employs a visually enriched privileged context for its teacher model, utilizing a target bounding box and a Gaussian soft mask to provide informative guidance without revealing exact coordinates. Additionally, it incorporates entropy-guided distillation, adaptively weighting tokens based on digit significance and teacher confidence to focus optimization on the most impactful and reliable positions. Experiments across six GUI grounding benchmarks, including ScreenSpot-v2, UI-Vision, and OSWorld-G, demonstrate that GUI-SD consistently surpasses GRPO-based methods and naive OPSD in both accuracy and training efficiency.

Key takeaway

For research scientists developing autonomous GUI agents, GUI-SD offers a more efficient and accurate alternative to GRPO-based methods. You should consider implementing on-policy self-distillation with visually enriched contexts and entropy-guided loss to overcome sparse reward issues and improve training efficiency in GUI grounding tasks. This approach can lead to more robust and precise coordinate generation.

Key insights

GUI-SD improves GUI grounding via on-policy self-distillation with visual context and entropy-guided token weighting.

Principles

Dense token-level supervision improves training efficiency.
Privileged visual context guides without leaking exact coordinates.
Adaptive weighting focuses optimization on significant tokens.

Method

GUI-SD constructs a visually enriched privileged context using bounding boxes and Gaussian soft masks, combined with an instructional hint. It then applies an entropy-guided distillation loss, weighting tokens by position-based significance and teacher confidence.

In practice

Use bounding boxes and soft masks for visual guidance.
Prioritize higher-order digits in coordinate prediction.
Modulate supervision strength by teacher confidence.

Topics

GUI Grounding
On-Policy Self-Distillation
GRPO Methods
Visually Privileged Context
Entropy-Guided Distillation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.