Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

OpenFlo is a new user-experience (UX) evaluation agent that automates web usability testing by simulating human interaction with websites. Built on the Avenir-Web framework (Li et al., 2026), OpenFlo uses GUI grounding to interact with real web pages, unlike traditional tools relying on DOM parsing. The system integrates simulated user behavior profiles with a structured evaluation protocol, including the System Usability Scale (SUS) (Brooke, 1996), step-wise Single Ease Questions (SEQ) (Sauro and Dumas, 2009), and concurrent Think Aloud reasoning. This approach generates comprehensive UX reports with quantitative scores and qualitative insights, identifying specific friction points. A case study on Recreation.gov yielded a SUS score of 55.0/100.0 (Grade D), while a Discogs task achieved an 87.5 SUS score (Grade A+), demonstrating its ability to handle complex web environments and provide actionable feedback.

Key takeaway

For research scientists developing web applications, OpenFlo offers a robust solution for integrating continuous, high-fidelity UX evaluation directly into the software development lifecycle. You should consider adopting OpenFlo to automate usability testing, enabling faster iteration and ensuring your products meet user needs by identifying specific friction points through its combined quantitative and qualitative reporting.

Key insights

OpenFlo automates web UX evaluation by simulating human interaction with GUI grounding and combining quantitative and qualitative metrics.

Principles

Method

OpenFlo employs a multimodal grounding approach, combining DOM parsing with coordinate-based visual tagging. A central MLLM reasons about optimal next steps, while an adaptive memory and dynamic checklist maintain context. The evaluation pipeline includes Think Aloud, step-wise SEQ, and post-task SUS assessments.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.