GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

2026-04-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

GUI grounding models, despite reporting over 85% accuracy on standard benchmarks, exhibit significant brittleness, with accuracy dropping 27-56 percentage points when instructions demand spatial reasoning rather than direct element identification. Existing benchmarks fail to capture this vulnerability because they evaluate each screenshot with only a single, fixed instruction. Researchers introduce GUI-Perturbed, a novel controlled perturbation framework designed to independently vary visual scenes and instructions to rigorously assess grounding robustness. Evaluating three 7B models from the same architectural family, the study found that relational instructions consistently cause a systematic accuracy collapse across all models. Furthermore, a 70% browser zoom led to statistically significant performance degradation, and surprisingly, rank-8 LoRA fine-tuning with augmented data actually worsened performance instead of improving it.

Key takeaway

For research scientists developing or deploying GUI grounding models, you should prioritize robustness testing beyond standard benchmarks. Your models likely struggle with spatial reasoning and visual perturbations like browser zoom, which current metrics often miss. Incorporate frameworks like GUI-Perturbed to diagnose specific weaknesses and avoid fine-tuning strategies that might inadvertently degrade performance.

Key insights

GUI grounding models are brittle, especially with spatial reasoning and visual perturbations.

Principles

Benchmarks need varied instructions.
Spatial reasoning is a key weakness.
Augmentation can degrade performance.

Method

GUI-Perturbed independently varies visual scenes and instructions to measure grounding robustness, isolating affected capability axes like spatial reasoning and visual robustness.

In practice

Test models with relational instructions.
Evaluate performance under UI zoom.
Re-evaluate LoRA fine-tuning strategies.

Topics

GUI Grounding Models
Domain Randomization
Spatial Reasoning
Visual Robustness
LoRA Fine-tuning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.