Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

The paper introduces UXBench, a novel multimodal benchmark with 2,000 VQA data samples designed to evaluate Multimodal Large Language Models' (MLLMs) UI-based reasoning for user experience (UX). UXBench features 8 tasks derived from real-world UI screenshots, requiring fine-grained diagnosis of UX issues across layout, visual hierarchy, and content consistency. Current mainstream MLLMs show limitations in this area. To address this, the authors propose UI-UX, an MLLM built on Qwen3-VL-4B-Thinking, enhanced via reinforcement learning. UI-UX incorporates a reward routing mechanism and an asymmetric transition reward to balance perceptual understanding and logical reasoning while suppressing redundant steps. UI-UX achieves a state-of-the-art accuracy of 0.7963 on UXBench, significantly outperforming Claude-4.5-Sonnet's 0.6550, demonstrating strong generalization and low inference latency.

Key takeaway

For AI Scientists and Machine Learning Engineers developing MLLMs for UI/UX evaluation, this research highlights that current models struggle with experience-based reasoning. You should consider adopting reinforcement learning with task-adaptive reward routing and asymmetric transition rewards, as demonstrated by UI-UX, to improve diagnostic accuracy and reasoning efficiency. Integrating hard negative mining and multi-domain training is crucial for building robust, generalizable MLLMs capable of identifying complex UX issues beyond simple visual defects.

Key insights

MLLMs can diagnose complex UI/UX issues by combining task-adaptive rewards and reasoning efficiency penalties.

Principles

Method

UI-UX enhances Qwen3-VL-4B-Thinking via reinforcement learning, using a reward routing mechanism for task-adaptive metrics (accuracy, ROUGE-L, hit reward) and an asymmetric transition reward to penalize redundant reasoning steps.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.