Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

2026-02-06 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

A novel machine learning method trains computer use agents (CUAs) to assess graphical user interface (GUI) usability, addressing the high cost and time of traditional testing. The method, detailed in a 2026 paper, operationalizes a computational definition of usability by prioritizing interaction flows, executing human-like interactions, and predicting a numerical usability score. The researchers developed uxCUA, a CUA trained on uxWeb, a large-scale dataset of 2,586 interactive UIs with usability labels and human preferences. uxWeb was created by synthetically augmenting popular website clones with known usability defects based on Shneiderman's 8 Golden Rules. uxCUA significantly outperforms larger models like GPT-5-mini and Claude-3-Opus in usability assessment accuracy, achieving 85% agreement with synthetic ground truths and 72% with human designer preferences. It also generates realistic and specific critiques, often referencing its "thought history" to explain identified issues.

Key takeaway

For AI Scientists and Machine Learning Engineers developing UI evaluation tools, this research indicates that fine-tuning specialized CUAs on large, defect-augmented datasets significantly improves automated usability assessment over general-purpose VLMs. You should consider building custom datasets with controlled defect injection and human preference labeling to achieve higher accuracy and more actionable critiques, rather than relying solely on off-the-shelf foundation models for complex UI evaluation tasks.

Key insights

A new CUA, uxCUA, accurately assesses GUI usability and generates critiques by learning from a large, defect-augmented UI dataset.

Principles

Automated usability assessment requires data-driven methods.
Synthetic defect injection scales dataset creation.
Human preferences align models with subjective usability.

Method

The training algorithm involves rollout generation, reward assignment based on navigation quality and usability assessment accuracy (using a contrastive learning objective), and policy updates via Proximal Policy Optimization (PPO).

In practice

Use synthetic augmentation to create large-scale UI datasets.
Incorporate human feedback for subjective alignment.
Prioritize interaction flows for efficient usability testing.

Topics

Computer Use Agents
GUI Usability Assessment
Machine Learning Training
uxWeb Dataset
Usability Defect Injection

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.