EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
Summary
EngramaBench is a new benchmark designed to evaluate long-term conversational memory in large language model assistants, focusing on multi-session interactions. It features five personas, 100 multi-session conversations, and 150 queries across five categories: factual recall (single_space), cross-space integration (cross_space), temporal reasoning (temporal_cross_space), adversarial abstention, and emergent synthesis. The benchmark evaluates Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval system. All systems use GPT-4o as the answering model to isolate memory architecture effects. GPT-4o full-context achieved the highest composite score of 0.6186, while Engrama scored 0.5367 globally but outperformed GPT-4o on cross_space reasoning (0.6532 vs. 0.6291). Mem0 was the cheapest at $0.36 but weakest overall (0.4809). Ablation studies on Engrama revealed a trade-off where components enhancing cross-space performance reduced the global composite score.
Key takeaway
For AI Engineers designing conversational agents, this research highlights that while full-context prompting with models like GPT-4o remains strong for general long-term memory, graph-structured memory systems like Engrama offer a measurable advantage in complex cross-space reasoning tasks. You should consider implementing structured memory for applications requiring deep integration of information across distinct user life domains, even if it means a slight trade-off in overall composite score. Further optimize structured memory components to balance specialized strengths with aggregate performance.
Key insights
Structured memory excels at cross-space reasoning, but full-context prompting currently leads in overall performance.
Principles
- Memory architecture significantly impacts LLM conversational performance.
- Cross-space integration is a key differentiator for structured memory.
- Cost-quality trade-offs exist between memory systems.
Method
Engrama processes conversations into a graph-structured memory organized by entities, semantic spaces, temporal traces, and associative links, then activates relevant neighborhoods for query-time summarization.
In practice
- Consider graph-structured memory for complex cross-domain queries.
- Evaluate memory systems beyond aggregate scores for specific reasoning tasks.
- Be aware of cost implications for different memory architectures.
Topics
- EngramaBench
- Long-Term Conversational Memory
- Graph-Structured Memory
- Cross-Space Reasoning
- GPT-4o Full-Context Prompting
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.