LVLMs and Humans Ground Differently in Referential Communication

2026-01-01 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies, Research Methodology & Innovation · Depth: Expert, extended

Summary

A referential communication experiment investigated how Large Vision Language Models (LVLMs) like OpenAI's GPT-5.2 compare to humans in establishing common ground during multi-turn interactions. The study employed a factorial design, creating human-human, human-AI, AI-human, and AI-AI director-matcher pairs, and analyzed a corpus of 356 dialogues over four rounds. Human-human pairs consistently improved accuracy (from 80% to over 90%) and efficiency, reducing words and turns. In contrast, AI-AI pairs started with high accuracy (90%) but declined, showing no efficiency gains or common ground formation. Mixed pairs also struggled, with human-AI showing the lowest initial accuracy and AI-human experiencing precipitous declines. The findings highlight LVLMs' inability to adapt communication or track common ground.

Key takeaway

For AI Scientists developing collaborative agents, recognize that current LVLMs like GPT-5.2 struggle significantly with establishing common ground and adapting communication over multiple turns. This deficit leads to decreased accuracy and efficiency in human-AI interactions, particularly when the AI takes initiative. You should prioritize research into models that can genuinely learn from dialogue history and form conceptual pacts to prevent user frustration and task failures in real-world applications.

Key insights

LVLMs fail to build common ground and adapt communication in multi-turn referential tasks, unlike humans.

Principles

Human communication relies on incremental grounding and conceptual pacts for efficiency.
AI models do not adapt communication strategies based on dialogue history.
Pragmatic prompting alone does not induce human-like grounding in LVLMs.

Method

A referential communication task with director-matcher pairs (human/AI) over four rounds, identifying non-lexicalized objects, measuring accuracy, effort, and lexical entrainment.

In practice

Avoid deploying LVLMs in high-initiative, collaborative human-facing roles.
Evaluate AI agents for multi-turn grounding capabilities beyond initial accuracy.

Topics

Referential Communication
Large Vision Language Models
Common Ground
Human-AI Interaction
GPT-5.2
Lexical Entrainment

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.