GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
Summary
GroupMemBench is a new benchmark designed to evaluate Large Language Model (LLM) agent memory in multi-party conversations, addressing limitations of existing benchmarks focused on dyadic interactions. It measures three critical properties of group memory: group dynamics beyond one-on-one chats, speaker-grounded belief tracking for per-user memory, and audience-adapted language reflecting Theory-of-Mind shifts. The benchmark utilizes a graph-grounded synthesis pipeline to generate multi-party conversations with controlled reply structures, incorporating per-user personas and target audiences. An adversarial query pipeline creates challenging, realistic questions across six categories, including multi-hop reasoning and knowledge update. Initial benchmarking of leading memory systems revealed poor performance, with the strongest achieving only 46.0% average accuracy, and specific categories like knowledge update at 27.1% and term ambiguity at 37.7%. A simple BM25 baseline often matched or surpassed most agent memory systems, indicating current memory ingestion methods fail to preserve crucial structural and lexical features in multi-user contexts.
Key takeaway
For AI Engineers developing LLM agents for collaborative environments, you should prioritize memory systems capable of handling complex multi-party interactions. Current systems struggle significantly with group dynamics, speaker-grounded belief tracking, and audience-adapted language, often performing worse than simple baselines. Focus on improving memory ingestion to preserve structural and lexical features crucial for multi-user contexts, and rigorously test with benchmarks like GroupMemBench to identify specific weaknesses in areas like knowledge update and term ambiguity.
Key insights
LLM agent memory systems perform poorly in multi-party conversations, often outperformed by simple baselines.
Principles
- Group dynamics require distinct memory modeling.
- Speaker-grounded belief tracking is essential.
- Audience-adapted language impacts memory recall.
Method
GroupMemBench uses a graph-grounded synthesis pipeline for multi-party conversations and an adversarial query pipeline across six categories to test LLM agent memory.
In practice
- Evaluate LLM agents with multi-party benchmarks.
- Focus on knowledge update in group contexts.
- Consider BM25 as a strong baseline.
Topics
- LLM Agent Memory
- Multi-Party Conversations
- GroupMemBench
- Speaker-Grounded Belief Tracking
- Audience-Adapted Language
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.