MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MemoryDocDataSet is a new synthetic benchmark designed to evaluate AI systems' ability to simultaneously navigate multi-session conversation history and perform deep reading comprehension within long documents. Comprising 50 micro-worlds and 1,000 QA pairs, each instance includes 3-5 personas, a temporal event graph, 3-5 real long documents (20,000-50,000 tokens from Caselaw Access Project), and multi-session conversations. Its defining feature is "Hybrid" questions, which constitute 75.1% of the dataset, requiring systems to first identify relevant documents from conversation history, then extract answers. Dataset quality was characterized using LLM-as-judge, yielding a median Cohen's κ= 0.634. Baseline evaluations across six configurations, including RAG and long-context LLMs, showed the best system (RAG-Both) achieved 0.358 F1 overall and 0.342 on Hybrid, revealing a significant "joint-retrieval gap" where document-only retrieval (RAG-Doc) collapsed to 0.267 on Hybrid questions.

Key takeaway

For AI Scientists and Machine Learning Engineers developing advanced systems for complex information retrieval, you should recognize that current architectures exhibit a significant "joint-retrieval gap" when combining conversational memory with long document reasoning. Your evaluation efforts should incorporate benchmarks like MemoryDocDataSet to expose these limitations. Focus your development on novel architectures that explicitly unify conversational history navigation and deep document comprehension to overcome this challenge.

Key insights

MemoryDocDataSet uniquely benchmarks AI systems on joint conversational memory and long document reasoning, revealing a critical performance gap.

Principles

Method

The MemoryDocDataSet generation pipeline creates synthetic micro-worlds with personas, temporal event graphs, long documents, multi-session conversations, and categorized QA pairs, including "Hybrid" questions.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.