Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

2025-02-25 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

The paper introduces UXBench, a novel multimodal benchmark with 2,000 VQA data samples designed to evaluate Multimodal Large Language Models' (MLLMs) UI-based reasoning for user experience (UX). UXBench features 8 tasks derived from real-world UI screenshots, requiring fine-grained diagnosis of UX issues across layout, visual hierarchy, and content consistency. Current mainstream MLLMs show limitations in this area. To address this, the authors propose UI-UX, an MLLM built on Qwen3-VL-4B-Thinking, enhanced via reinforcement learning. UI-UX incorporates a reward routing mechanism and an asymmetric transition reward to balance perceptual understanding and logical reasoning while suppressing redundant steps. UI-UX achieves a state-of-the-art accuracy of 0.7963 on UXBench, significantly outperforming Claude-4.5-Sonnet's 0.6550, demonstrating strong generalization and low inference latency.

Key takeaway

For AI Scientists and Machine Learning Engineers developing MLLMs for UI/UX evaluation, this research highlights that current models struggle with experience-based reasoning. You should consider adopting reinforcement learning with task-adaptive reward routing and asymmetric transition rewards, as demonstrated by UI-UX, to improve diagnostic accuracy and reasoning efficiency. Integrating hard negative mining and multi-domain training is crucial for building robust, generalizable MLLMs capable of identifying complex UX issues beyond simple visual defects.

Key insights

MLLMs can diagnose complex UI/UX issues by combining task-adaptive rewards and reasoning efficiency penalties.

Principles

UX issues require causal reasoning beyond pixel perception.
Reward routing optimizes MLLMs for heterogeneous tasks.
Asymmetric rewards balance reasoning sufficiency and conciseness.

Method

UI-UX enhances Qwen3-VL-4B-Thinking via reinforcement learning, using a reward routing mechanism for task-adaptive metrics (accuracy, ROUGE-L, hit reward) and an asymmetric transition reward to penalize redundant reasoning steps.

In practice

Use UXBench to evaluate MLLM UI-UX reasoning.
Implement hard negative mining for imbalanced datasets.
Apply multi-domain training for MLLM generalization.

Topics

Multimodal LLMs
User Experience
UI Reasoning
Reinforcement Learning
UXBench
Qwen3-VL-4B-Thinking

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.