ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

2026-06-18 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The \textsc{ROSE} (Reference-conditioned Oddity and Symbolic Execution) benchmark has been introduced to evaluate the "perception-to-action gap" in multimodal large language models (MLLMs). This controlled benchmark addresses how reliably MLLMs can translate visual evidence into context-specific actions, even when the visual scene remains constant but region constraints and required symbolic outputs vary. \textsc{ROSE} employs coupled counting and coordinate-action tasks to test models' ability to infer implicit majority references and act on fine-grained visual details under changing contexts. Across nine recent MLLMs, performance significantly drops by as much as 44.5 percentage points when moving from counting-oriented tasks to region-conditioned action, contrasting sharply with 98.8% human performance. This gap persists even for scenes where models correctly perform initial counting, indicating a distinct, model-dependent bottleneck beyond just coordinate grounding in converting shared visual evidence into context-specific actions.

Key takeaway

For Machine Learning Engineers developing multimodal models, you should prioritize addressing the significant "perception-to-action gap" identified by the \textsc{ROSE} benchmark. Your models currently exhibit up to a 44.5 percentage point drop in performance when moving from simple counting to context-specific actions, even with correct visual grounding. Focus your development efforts on improving contextual inference and symbolic execution from shared visual evidence, rather than solely on visual perception accuracy, to enhance real-world applicability.

Key insights

MLLMs struggle to translate visual perception into context-specific actions, revealing a significant "perception-to-action gap."

Principles

Contextual action from fixed visual input is a key MLLM challenge.
Coordinate grounding alone does not explain MLLM action failures.
Benchmarking requires varying constraints on fixed visual scenes.

Method

The \textsc{ROSE} benchmark fixes visual scenes while varying region constraints and symbolic outputs. It uses coupled counting and coordinate-action tasks to test contextual inference.

In practice

Evaluate MLLMs on context-dependent action tasks.
Design benchmarks with fixed visual inputs, varied constraints.
Focus MLLM development on contextual action inference.

Topics

Multimodal Large Language Models
MLLM Benchmarking
Perception-to-Action Gap
Contextual Inference
Symbolic Execution
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.