Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Computer Vision · Depth: Expert, quick

Summary

A study investigating vision-language models (VLMs) reveals that they struggle to distinguish between potential and established common ground in asymmetric dialogue. Evaluating 13,077 annotated reference expressions from HCRC MapTask dialogues, researchers found that providing authentic map images improved overall VLM performance but introduced a bias towards over-predicting alignment. This bias was also observed with textual map descriptions, indicating it stems from task-relevant content, not the visual channel itself. Models like Qwen3-VL-8B-Instruct, and four others from two architecture families, showed degraded accuracy on non-aligned cases, relying on static referential cues rather than tracking dialogue history. This conflates shared perception with mutual understanding.

Key takeaway

For NLP Engineers developing vision-language models for collaborative dialogue, be aware that current VLMs, like Qwen3-VL-8B-Instruct, tend to over-predict mutual understanding based on shared visual or textual content. You must explicitly design training and evaluation to distinguish between potential common ground and established understanding through interaction, rather than relying solely on static referential cues, to prevent miscommunication in real-world applications.

Key insights

Vision-language models often conflate potential common ground with established mutual understanding in dialogue.

Principles

Method

An interpretation-matching task on 13,077 HCRC MapTask dialogues, systematically manipulating dialogue context and map-information access.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.