LVLMs and Humans Ground Differently in Referential Communication
Summary
A referential communication experiment investigated how Large Vision Language Models (LVLMs) like OpenAI's GPT-5.2 compare to humans in establishing common ground during multi-turn interactions. The study employed a factorial design, creating human-human, human-AI, AI-human, and AI-AI director-matcher pairs, and analyzed a corpus of 356 dialogues over four rounds. Human-human pairs consistently improved accuracy (from 80% to over 90%) and efficiency, reducing words and turns. In contrast, AI-AI pairs started with high accuracy (90%) but declined, showing no efficiency gains or common ground formation. Mixed pairs also struggled, with human-AI showing the lowest initial accuracy and AI-human experiencing precipitous declines. The findings highlight LVLMs' inability to adapt communication or track common ground.
Key takeaway
For AI Scientists developing collaborative agents, recognize that current LVLMs like GPT-5.2 struggle significantly with establishing common ground and adapting communication over multiple turns. This deficit leads to decreased accuracy and efficiency in human-AI interactions, particularly when the AI takes initiative. You should prioritize research into models that can genuinely learn from dialogue history and form conceptual pacts to prevent user frustration and task failures in real-world applications.
Key insights
LVLMs fail to build common ground and adapt communication in multi-turn referential tasks, unlike humans.
Principles
- Human communication relies on incremental grounding and conceptual pacts for efficiency.
- AI models do not adapt communication strategies based on dialogue history.
- Pragmatic prompting alone does not induce human-like grounding in LVLMs.
Method
A referential communication task with director-matcher pairs (human/AI) over four rounds, identifying non-lexicalized objects, measuring accuracy, effort, and lexical entrainment.
In practice
- Avoid deploying LVLMs in high-initiative, collaborative human-facing roles.
- Evaluate AI agents for multi-turn grounding capabilities beyond initial accuracy.
Topics
- Referential Communication
- Large Vision Language Models
- Common Ground
- Human-AI Interaction
- GPT-5.2
- Lexical Entrainment
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.