Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new benchmark, IMAVB, comprising 500 movie clips, evaluates omnimodal large language models' ability to detect textual claims that contradict their visual or auditory input. The benchmark employs a 2x2 design, varying target modality (vision, audio) and premise condition (standard, misleading), to isolate conflict detection from general multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, a "Representation-Action Gap" was identified: models' internal states encode premise-perception mismatches, but their outputs rarely reject false claims. Models exhibit under-rejection (answering misleading questions as if true) and over-rejection (rejecting standard questions), with audio grounding performing worse than vision. This gap persists despite seven prompt variations, suggesting a translation bottleneck rather than a perception failure. A probe-guided logit adjustment (PGLA) intervention improved rejection behavior.

Key takeaway

For AI Engineers developing omnimodal LLMs, understanding the Representation-Action Gap is crucial. Your models may internally detect contradictions but fail to act on them, leading to unreliable outputs. Focus development efforts on improving the translation layer between internal representations and output generation, potentially by integrating techniques like probe-guided logit adjustment, to enhance grounding and prevent misleading responses.

Key insights

Omnimodal LLMs struggle to reject textual claims contradicting their sensory input, despite internal encoding of mismatches.

Principles

Method

The IMAVB benchmark uses a 2x2 design (modality x premise condition) with long-form movie clips to measure conflict detection in omnimodal LLMs, distinguishing it from general comprehension.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.