EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

EngramaBench is a new benchmark designed to evaluate long-term conversational memory in large language model assistants, focusing on multi-session interactions. It features five personas, 100 multi-session conversations, and 150 queries across five categories: factual recall (single_space), cross-space integration (cross_space), temporal reasoning (temporal_cross_space), adversarial abstention, and emergent synthesis. The benchmark evaluates Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval system. All systems use GPT-4o as the answering model to isolate memory architecture effects. GPT-4o full-context achieved the highest composite score of 0.6186, while Engrama scored 0.5367 globally but outperformed GPT-4o on cross_space reasoning (0.6532 vs. 0.6291). Mem0 was the cheapest at $0.36 but weakest overall (0.4809). Ablation studies on Engrama revealed a trade-off where components enhancing cross-space performance reduced the global composite score.

Key takeaway

For AI Engineers designing conversational agents, this research highlights that while full-context prompting with models like GPT-4o remains strong for general long-term memory, graph-structured memory systems like Engrama offer a measurable advantage in complex cross-space reasoning tasks. You should consider implementing structured memory for applications requiring deep integration of information across distinct user life domains, even if it means a slight trade-off in overall composite score. Further optimize structured memory components to balance specialized strengths with aggregate performance.

Key insights

Structured memory excels at cross-space reasoning, but full-context prompting currently leads in overall performance.

Principles

Method

Engrama processes conversations into a graph-structured memory organized by entities, semantic spaces, temporal traces, and associative links, then activates relevant neighborhoods for query-time summarization.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.