GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
Summary
GUI grounding models, despite reporting over 85% accuracy on standard benchmarks, exhibit significant brittleness, with accuracy dropping 27-56 percentage points when instructions demand spatial reasoning rather than direct element identification. Existing benchmarks fail to capture this vulnerability because they evaluate each screenshot with only a single, fixed instruction. Researchers introduce GUI-Perturbed, a novel controlled perturbation framework designed to independently vary visual scenes and instructions to rigorously assess grounding robustness. Evaluating three 7B models from the same architectural family, the study found that relational instructions consistently cause a systematic accuracy collapse across all models. Furthermore, a 70% browser zoom led to statistically significant performance degradation, and surprisingly, rank-8 LoRA fine-tuning with augmented data actually worsened performance instead of improving it.
Key takeaway
For research scientists developing or deploying GUI grounding models, you should prioritize robustness testing beyond standard benchmarks. Your models likely struggle with spatial reasoning and visual perturbations like browser zoom, which current metrics often miss. Incorporate frameworks like GUI-Perturbed to diagnose specific weaknesses and avoid fine-tuning strategies that might inadvertently degrade performance.
Key insights
GUI grounding models are brittle, especially with spatial reasoning and visual perturbations.
Principles
- Benchmarks need varied instructions.
- Spatial reasoning is a key weakness.
- Augmentation can degrade performance.
Method
GUI-Perturbed independently varies visual scenes and instructions to measure grounding robustness, isolating affected capability axes like spatial reasoning and visual robustness.
In practice
- Test models with relational instructions.
- Evaluate performance under UI zoom.
- Re-evaluate LoRA fine-tuning strategies.
Topics
- GUI Grounding Models
- Domain Randomization
- Spatial Reasoning
- Visual Robustness
- LoRA Fine-tuning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.