Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

2026-05-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new benchmark, IMAVB, comprising 500 movie clips, evaluates omnimodal large language models' ability to detect textual claims that contradict their visual or auditory input. The benchmark employs a 2x2 design, varying target modality (vision, audio) and premise condition (standard, misleading), to isolate conflict detection from general multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, a "Representation-Action Gap" was identified: models' internal states encode premise-perception mismatches, but their outputs rarely reject false claims. Models exhibit under-rejection (answering misleading questions as if true) and over-rejection (rejecting standard questions), with audio grounding performing worse than vision. This gap persists despite seven prompt variations, suggesting a translation bottleneck rather than a perception failure. A probe-guided logit adjustment (PGLA) intervention improved rejection behavior.

Key takeaway

For AI Engineers developing omnimodal LLMs, understanding the Representation-Action Gap is crucial. Your models may internally detect contradictions but fail to act on them, leading to unreliable outputs. Focus development efforts on improving the translation layer between internal representations and output generation, potentially by integrating techniques like probe-guided logit adjustment, to enhance grounding and prevent misleading responses.

Key insights

Omnimodal LLMs struggle to reject textual claims contradicting their sensory input, despite internal encoding of mismatches.

Principles

Grounding failures are often translation, not perception.
Audio grounding is more challenging than vision grounding.

Method

The IMAVB benchmark uses a 2x2 design (modality x premise condition) with long-form movie clips to measure conflict detection in omnimodal LLMs, distinguishing it from general comprehension.

In practice

Use IMAVB to benchmark omnimodal LLM grounding.
Implement PGLA for improved rejection behavior.

Topics

Omnimodal LLMs
Representation-Action Gap
IMAVB Benchmark
Multimodal Grounding
Conflict Detection

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.