UXBench: Measuring the Actionability of LLM-Generated UX Critiques

2026-05-26 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

UXBench is a new benchmark designed to measure the actionability and reliability of Large Language Model (LLM)-generated User Experience (UX) critiques across diverse product surfaces. It comprises local-first runnable web fixtures spanning ten product-surface families, including landing pages, checkout flows, and dashboards. The benchmark forces LLM judges to collect interaction evidence through coverage-gated browser exploration before generating a structured UX report across seven rubric dimensions. Report quality is quantified by whether a fixed downstream repair agent can improve the interface based on the critique. Evaluation of eight frontier models, including GPT-5.4, Claude-Sonnet-4.6, and Gemini-3.1-Pro, revealed that UX judging is not saturated or one-dimensional. GPT-5.4 achieved the largest repair lift (+0.22), while Gemini-3.1-Pro had the smallest (+0.14), showing an 0.08-point spread on the 1–5 rubric scale. Models also exhibited distinct rubric-level repair signatures and varied in fixture-level reliability and competence across surface categories.

Key takeaway

For AI Engineers developing or integrating LLMs for UX evaluation, you should prioritize models capable of interaction-grounded critique. Your selection should consider a model's demonstrated actionability across specific UX dimensions and product surface types, as aggregate scores can mask uneven performance. Validate your chosen LLM's output with human expert review, especially for critical interfaces, to ensure perceived quality aligns with automated repair lift.

Key insights

LLM UX critique actionability varies significantly, requiring interaction-grounded evaluation beyond static analysis.

Principles

UX judging is multi-dimensional, not saturated.
Model strengths vary across UX rubric dimensions.
Reliability differs across interface types.

Method

UXBench evaluates LLMs by having them explore web fixtures with coverage-gated browsing, generate evidence-grounded reports, and then measure interface improvement via a fixed repair agent.

In practice

Use interaction-grounded LLM evaluation.
Compare models on specific UX dimensions.
Validate automated scores with human review.

Topics

LLM Evaluation
User Experience (UX) Critique
Web Agents
Benchmark Development
Interface Repair
Usability Testing

Code references

Jackwwj619/UXBench

Best for: AI Scientist, Research Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.