GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

2026-05-14 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

GroupMemBench is a new benchmark designed to evaluate Large Language Model (LLM) agent memory in multi-party conversational settings, addressing limitations of existing dyadic benchmarks. It specifically measures three critical aspects of group memory: group dynamics, speaker-grounded belief tracking, and audience-adapted language. The benchmark utilizes a graph-grounded synthesis pipeline to generate multi-party conversations with controlled reply structures, incorporating per-user personas and target audiences. An adversarial query pipeline then creates challenging, realistic questions across six categories, including multi-hop reasoning and knowledge update. Initial benchmarking of leading memory systems on GroupMemBench revealed a significant performance drop, with the strongest system achieving only 46.0% average accuracy, and a simple BM25 baseline often outperforming more complex agent memory systems.

Key takeaway

For research scientists developing LLM agents for collaborative environments, GroupMemBench highlights a critical gap in current memory systems. You should focus on designing memory architectures that explicitly account for group dynamics, speaker-grounded beliefs, and audience adaptation, as existing solutions struggle significantly. Re-evaluate your approach to memory ingestion to prevent the loss of crucial structural and lexical features in multi-user contexts.

Key insights

Existing LLM agent memory systems fail significantly in multi-party conversations, often underperforming simple baselines.

Principles

Group memory requires speaker-grounded belief tracking.
Audience-adapted language is crucial for multi-user contexts.

Method

GroupMemBench uses a graph-grounded synthesis pipeline for multi-party conversations and an adversarial query pipeline to generate challenging, user-specific questions across six categories.

In practice

Prioritize memory systems that retain structural and lexical features.
Consider BM25 as a strong baseline for multi-user memory.

Topics

LLM Agent Memory
Multi-Party Conversations
GroupMemBench
Memory Benchmarking
Speaker-Grounded Belief Tracking

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.