Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
Summary
The paper introduces UXBench, a novel multimodal benchmark with 2,000 VQA data samples designed to evaluate Multimodal Large Language Models' (MLLMs) UI-based reasoning for user experience (UX). UXBench features 8 tasks derived from real-world UI screenshots, requiring fine-grained diagnosis of UX issues across layout, visual hierarchy, and content consistency. Current mainstream MLLMs show limitations in this area. To address this, the authors propose UI-UX, an MLLM built on Qwen3-VL-4B-Thinking, enhanced via reinforcement learning. UI-UX incorporates a reward routing mechanism and an asymmetric transition reward to balance perceptual understanding and logical reasoning while suppressing redundant steps. UI-UX achieves a state-of-the-art accuracy of 0.7963 on UXBench, significantly outperforming Claude-4.5-Sonnet's 0.6550, demonstrating strong generalization and low inference latency.
Key takeaway
For AI Scientists and Machine Learning Engineers developing MLLMs for UI/UX evaluation, this research highlights that current models struggle with experience-based reasoning. You should consider adopting reinforcement learning with task-adaptive reward routing and asymmetric transition rewards, as demonstrated by UI-UX, to improve diagnostic accuracy and reasoning efficiency. Integrating hard negative mining and multi-domain training is crucial for building robust, generalizable MLLMs capable of identifying complex UX issues beyond simple visual defects.
Key insights
MLLMs can diagnose complex UI/UX issues by combining task-adaptive rewards and reasoning efficiency penalties.
Principles
- UX issues require causal reasoning beyond pixel perception.
- Reward routing optimizes MLLMs for heterogeneous tasks.
- Asymmetric rewards balance reasoning sufficiency and conciseness.
Method
UI-UX enhances Qwen3-VL-4B-Thinking via reinforcement learning, using a reward routing mechanism for task-adaptive metrics (accuracy, ROUGE-L, hit reward) and an asymmetric transition reward to penalize redundant reasoning steps.
In practice
- Use UXBench to evaluate MLLM UI-UX reasoning.
- Implement hard negative mining for imbalanced datasets.
- Apply multi-domain training for MLLM generalization.
Topics
- Multimodal LLMs
- User Experience
- UI Reasoning
- Reinforcement Learning
- UXBench
- Qwen3-VL-4B-Thinking
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.