Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue
Summary
A study investigating vision-language models (VLMs) reveals that they struggle to distinguish between potential and established common ground in asymmetric dialogue. Evaluating 13,077 annotated reference expressions from HCRC MapTask dialogues, researchers found that providing authentic map images improved overall VLM performance but introduced a bias towards over-predicting alignment. This bias was also observed with textual map descriptions, indicating it stems from task-relevant content, not the visual channel itself. Models like Qwen3-VL-8B-Instruct, and four others from two architecture families, showed degraded accuracy on non-aligned cases, relying on static referential cues rather than tracking dialogue history. This conflates shared perception with mutual understanding.
Key takeaway
For NLP Engineers developing vision-language models for collaborative dialogue, be aware that current VLMs, like Qwen3-VL-8B-Instruct, tend to over-predict mutual understanding based on shared visual or textual content. You must explicitly design training and evaluation to distinguish between potential common ground and established understanding through interaction, rather than relying solely on static referential cues, to prevent miscommunication in real-world applications.
Key insights
Vision-language models often conflate potential common ground with established mutual understanding in dialogue.
Principles
- Shared perception does not guarantee shared interpretation.
- Mutual understanding requires interaction, not just shared data.
- VLMs can be misled by static referential cues.
Method
An interpretation-matching task on 13,077 HCRC MapTask dialogues, systematically manipulating dialogue context and map-information access.
In practice
- Test VLM common ground with asymmetric dialogue tasks.
- Design VLM training to distinguish static cues from dynamic grounding.
Topics
- Vision-Language Models
- Dialogue Systems
- Common Ground
- Mutual Understanding
- HCRC MapTask
- Qwen3-VL-8B-Instruct
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.