GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

2026-05-15 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

GroupMemBench is a new benchmark designed to evaluate Large Language Model (LLM) agent memory in multi-party conversations, addressing limitations of existing benchmarks focused on dyadic interactions. It measures three critical properties of group memory: group dynamics beyond one-on-one chats, speaker-grounded belief tracking for per-user memory, and audience-adapted language reflecting Theory-of-Mind shifts. The benchmark utilizes a graph-grounded synthesis pipeline to generate multi-party conversations with controlled reply structures, incorporating per-user personas and target audiences. An adversarial query pipeline creates challenging, realistic questions across six categories, including multi-hop reasoning and knowledge update. Initial benchmarking of leading memory systems revealed poor performance, with the strongest achieving only 46.0% average accuracy, and specific categories like knowledge update at 27.1% and term ambiguity at 37.7%. A simple BM25 baseline often matched or surpassed most agent memory systems, indicating current memory ingestion methods fail to preserve crucial structural and lexical features in multi-user contexts.

Key takeaway

For AI Engineers developing LLM agents for collaborative environments, you should prioritize memory systems capable of handling complex multi-party interactions. Current systems struggle significantly with group dynamics, speaker-grounded belief tracking, and audience-adapted language, often performing worse than simple baselines. Focus on improving memory ingestion to preserve structural and lexical features crucial for multi-user contexts, and rigorously test with benchmarks like GroupMemBench to identify specific weaknesses in areas like knowledge update and term ambiguity.

Key insights

LLM agent memory systems perform poorly in multi-party conversations, often outperformed by simple baselines.

Principles

Group dynamics require distinct memory modeling.
Speaker-grounded belief tracking is essential.
Audience-adapted language impacts memory recall.

Method

GroupMemBench uses a graph-grounded synthesis pipeline for multi-party conversations and an adversarial query pipeline across six categories to test LLM agent memory.

In practice

Evaluate LLM agents with multi-party benchmarks.
Focus on knowledge update in group contexts.
Consider BM25 as a strong baseline.

Topics

LLM Agent Memory
Multi-Party Conversations
GroupMemBench
Speaker-Grounded Belief Tracking
Audience-Adapted Language

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.