UXBench: Measuring the Actionability of LLM-Generated UX Critiques
Summary
UXBench is a new benchmark designed to measure the actionability and reliability of Large Language Model (LLM)-generated User Experience (UX) critiques across diverse product surfaces. It comprises local-first runnable web fixtures spanning ten product-surface families, including landing pages, checkout flows, and dashboards. The benchmark forces LLM judges to collect interaction evidence through coverage-gated browser exploration before generating a structured UX report across seven rubric dimensions. Report quality is quantified by whether a fixed downstream repair agent can improve the interface based on the critique. Evaluation of eight frontier models, including GPT-5.4, Claude-Sonnet-4.6, and Gemini-3.1-Pro, revealed that UX judging is not saturated or one-dimensional. GPT-5.4 achieved the largest repair lift (+0.22), while Gemini-3.1-Pro had the smallest (+0.14), showing an 0.08-point spread on the 1–5 rubric scale. Models also exhibited distinct rubric-level repair signatures and varied in fixture-level reliability and competence across surface categories.
Key takeaway
For AI Engineers developing or integrating LLMs for UX evaluation, you should prioritize models capable of interaction-grounded critique. Your selection should consider a model's demonstrated actionability across specific UX dimensions and product surface types, as aggregate scores can mask uneven performance. Validate your chosen LLM's output with human expert review, especially for critical interfaces, to ensure perceived quality aligns with automated repair lift.
Key insights
LLM UX critique actionability varies significantly, requiring interaction-grounded evaluation beyond static analysis.
Principles
- UX judging is multi-dimensional, not saturated.
- Model strengths vary across UX rubric dimensions.
- Reliability differs across interface types.
Method
UXBench evaluates LLMs by having them explore web fixtures with coverage-gated browsing, generate evidence-grounded reports, and then measure interface improvement via a fixed repair agent.
In practice
- Use interaction-grounded LLM evaluation.
- Compare models on specific UX dimensions.
- Validate automated scores with human review.
Topics
- LLM Evaluation
- User Experience (UX) Critique
- Web Agents
- Benchmark Development
- Interface Repair
- Usability Testing
Code references
Best for: AI Scientist, Research Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.