UXBench: Measuring the Actionability of LLM-Generated UX Critiques

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

UXBench is a new benchmark designed to measure the actionability and reliability of Large Language Model (LLM)-generated User Experience (UX) critiques across diverse product surfaces. It comprises local-first runnable web fixtures spanning ten product-surface families, including landing pages, checkout flows, and dashboards. The benchmark forces LLM judges to collect interaction evidence through coverage-gated browser exploration before generating a structured UX report across seven rubric dimensions. Report quality is quantified by whether a fixed downstream repair agent can improve the interface based on the critique. Evaluation of eight frontier models, including GPT-5.4, Claude-Sonnet-4.6, and Gemini-3.1-Pro, revealed that UX judging is not saturated or one-dimensional. GPT-5.4 achieved the largest repair lift (+0.22), while Gemini-3.1-Pro had the smallest (+0.14), showing an 0.08-point spread on the 1–5 rubric scale. Models also exhibited distinct rubric-level repair signatures and varied in fixture-level reliability and competence across surface categories.

Key takeaway

For AI Engineers developing or integrating LLMs for UX evaluation, you should prioritize models capable of interaction-grounded critique. Your selection should consider a model's demonstrated actionability across specific UX dimensions and product surface types, as aggregate scores can mask uneven performance. Validate your chosen LLM's output with human expert review, especially for critical interfaces, to ensure perceived quality aligns with automated repair lift.

Key insights

LLM UX critique actionability varies significantly, requiring interaction-grounded evaluation beyond static analysis.

Principles

Method

UXBench evaluates LLMs by having them explore web fixtures with coverage-gated browsing, generate evidence-grounded reports, and then measure interface improvement via a fixed repair agent.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.