GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

GroupMemBench is a new benchmark designed to evaluate Large Language Model (LLM) agent memory in multi-party conversations, addressing limitations of existing benchmarks focused on dyadic interactions. It measures three critical properties of group memory: group dynamics beyond one-on-one chats, speaker-grounded belief tracking for per-user memory, and audience-adapted language reflecting Theory-of-Mind shifts. The benchmark utilizes a graph-grounded synthesis pipeline to generate multi-party conversations with controlled reply structures, incorporating per-user personas and target audiences. An adversarial query pipeline creates challenging, realistic questions across six categories, including multi-hop reasoning and knowledge update. Initial benchmarking of leading memory systems revealed poor performance, with the strongest achieving only 46.0% average accuracy, and specific categories like knowledge update at 27.1% and term ambiguity at 37.7%. A simple BM25 baseline often matched or surpassed most agent memory systems, indicating current memory ingestion methods fail to preserve crucial structural and lexical features in multi-user contexts.

Key takeaway

For AI Engineers developing LLM agents for collaborative environments, you should prioritize memory systems capable of handling complex multi-party interactions. Current systems struggle significantly with group dynamics, speaker-grounded belief tracking, and audience-adapted language, often performing worse than simple baselines. Focus on improving memory ingestion to preserve structural and lexical features crucial for multi-user contexts, and rigorously test with benchmarks like GroupMemBench to identify specific weaknesses in areas like knowledge update and term ambiguity.

Key insights

LLM agent memory systems perform poorly in multi-party conversations, often outperformed by simple baselines.

Principles

Method

GroupMemBench uses a graph-grounded synthesis pipeline for multi-party conversations and an adversarial query pipeline across six categories to test LLM agent memory.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.