EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

2024-08-06 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

EngramaBench is a new benchmark designed to evaluate long-term conversational memory in large language model assistants, focusing on multi-session interactions. It features five personas, 100 multi-session conversations, and 150 queries across five categories: factual recall (single_space), cross-space integration (cross_space), temporal reasoning (temporal_cross_space), adversarial abstention, and emergent synthesis. The benchmark evaluates Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval system. All systems use GPT-4o as the answering model to isolate memory architecture effects. GPT-4o full-context achieved the highest composite score of 0.6186, while Engrama scored 0.5367 globally but outperformed GPT-4o on cross_space reasoning (0.6532 vs. 0.6291). Mem0 was the cheapest at $0.36 but weakest overall (0.4809). Ablation studies on Engrama revealed a trade-off where components enhancing cross-space performance reduced the global composite score.

Key takeaway

For AI Engineers designing conversational agents, this research highlights that while full-context prompting with models like GPT-4o remains strong for general long-term memory, graph-structured memory systems like Engrama offer a measurable advantage in complex cross-space reasoning tasks. You should consider implementing structured memory for applications requiring deep integration of information across distinct user life domains, even if it means a slight trade-off in overall composite score. Further optimize structured memory components to balance specialized strengths with aggregate performance.

Key insights

Structured memory excels at cross-space reasoning, but full-context prompting currently leads in overall performance.

Principles

Memory architecture significantly impacts LLM conversational performance.
Cross-space integration is a key differentiator for structured memory.
Cost-quality trade-offs exist between memory systems.

Method

Engrama processes conversations into a graph-structured memory organized by entities, semantic spaces, temporal traces, and associative links, then activates relevant neighborhoods for query-time summarization.

In practice

Consider graph-structured memory for complex cross-domain queries.
Evaluate memory systems beyond aggregate scores for specific reasoning tasks.
Be aware of cost implications for different memory architectures.

Topics

EngramaBench
Long-Term Conversational Memory
Graph-Structured Memory
Cross-Space Reasoning
GPT-4o Full-Context Prompting

Code references

julianacunadc/engramabench

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.