WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts
Summary
WebRISE is a new benchmark for evaluating Multimodal Large Language Model (MLLM)-generated web artifacts, addressing limitations of existing benchmarks that overlook requirement-induced states and transitions. It compiles task requirements into Interaction Contract Graphs (ICGs), defining observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE covers 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks. Across 14 MLLMs, the strongest model achieved only 65.6% transition validity and 66.3% requirement coverage. Visual quality was not a proxy for behavior (e.g., Qwen3.6-35B-A3B on Markdown: V=80.8, T=15.5). Video input provided the strongest interaction signal (+10.6 pp implicit coverage over Text), and ICG-based scoring detected state errors at 2-16x the rate of checkpoint-style evaluation.
Key takeaway
For AI Engineers evaluating MLLM performance in web artifact generation, you must move beyond visual checks and local evidence. WebRISE demonstrates that even top models struggle with requirement-induced states and transitions, achieving only 65.6% validity. Your evaluation strategy should incorporate state-based Interaction Contract Graphs (ICGs) to accurately assess functional correctness, as this method detects errors 2-16x more effectively than traditional checkpointing. Prioritize functional testing and consider video inputs for richer task specifications to improve model training and evaluation.
Key insights
WebRISE introduces Interaction Contract Graphs (ICGs) to evaluate MLLM-generated web artifacts based on requirement-induced states and transitions, revealing current model limitations.
Principles
- Visual quality is not a proxy for functional behavior.
- State-based evaluation detects errors more effectively.
- Video input improves implicit requirement coverage.
Method
WebRISE compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution.
In practice
- Adopt ICGs for comprehensive MLLM web artifact testing.
- Prioritize functional state evaluation over visual checks.
- Leverage video input for enhanced MLLM task specification.
Topics
- WebRISE
- MLLM Evaluation
- Web Artifact Generation
- Interaction Contract Graphs
- Multimodal LLMs
- Functional Testing
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.