Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
Summary
A new benchmark, IMAVB, comprising 500 movie clips, evaluates omnimodal large language models' ability to detect textual claims that contradict their visual or auditory input. The benchmark employs a 2x2 design, varying target modality (vision, audio) and premise condition (standard, misleading), to isolate conflict detection from general multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, a "Representation-Action Gap" was identified: models' internal states encode premise-perception mismatches, but their outputs rarely reject false claims. Models exhibit under-rejection (answering misleading questions as if true) and over-rejection (rejecting standard questions), with audio grounding performing worse than vision. This gap persists despite seven prompt variations, suggesting a translation bottleneck rather than a perception failure. A probe-guided logit adjustment (PGLA) intervention improved rejection behavior.
Key takeaway
For AI Engineers developing omnimodal LLMs, understanding the Representation-Action Gap is crucial. Your models may internally detect contradictions but fail to act on them, leading to unreliable outputs. Focus development efforts on improving the translation layer between internal representations and output generation, potentially by integrating techniques like probe-guided logit adjustment, to enhance grounding and prevent misleading responses.
Key insights
Omnimodal LLMs struggle to reject textual claims contradicting their sensory input, despite internal encoding of mismatches.
Principles
- Grounding failures are often translation, not perception.
- Audio grounding is more challenging than vision grounding.
Method
The IMAVB benchmark uses a 2x2 design (modality x premise condition) with long-form movie clips to measure conflict detection in omnimodal LLMs, distinguishing it from general comprehension.
In practice
- Use IMAVB to benchmark omnimodal LLM grounding.
- Implement PGLA for improved rejection behavior.
Topics
- Omnimodal LLMs
- Representation-Action Gap
- IMAVB Benchmark
- Multimodal Grounding
- Conflict Detection
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.