Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Multimodal Dialogue Systems · Depth: Expert, extended

Summary

This research introduces a novel framework for conversational agents to maintain common ground in situated dialogue by using "machine mental imagery." Current agents struggle with persistent context beyond immediate windows, leading to "representational blur" where distinct entities merge into vague textual descriptions. Inspired by human cognition, the proposed system actively constructs and updates a persistent visual history from dialogue state, incrementally converting it into schematic visual artifacts. Evaluated on the IndiRef benchmark, this "visual scaffolding" approach, implemented with Qwen-family models and Qwen-Image-Edit, improves performance over full-dialogue reasoning and textual externalization, particularly for Temporal, Attributive, and Inferred queries. While visual artifacts excel at preserving fine-grained perceptual distinctions and enforcing concrete scene commitments, textual representations remain superior for non-depictable information and cross-frame topological relations. A hybrid "Agentic-Both" setting, combining visual and textual artifacts, achieved the best overall performance, suggesting a complementary role for multimodal representations.

Key takeaway

For research scientists developing multimodal dialogue systems, you should consider implementing incremental externalization of common ground using a hybrid approach. Integrating both visual scaffolding for concrete perceptual distinctions and textual representations for abstract or non-depictable information will significantly improve persistent context tracking and reduce "representational blur." This strategy will lead to more reliable and grounded conversational agents, especially in complex, situated environments.

Key insights

Machine mental imagery, via visual scaffolding, enhances conversational agents' common ground by reducing representational blur.

Principles

Method

The framework uses an Observer to segment dialogue, a Constructor to generate schematic visual or textual artifacts, and a Linker to track cross-artifact relations. A Reasoner then queries this memory for grounded response generation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.