MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning
Summary
MemoryDocDataSet is a new synthetic benchmark designed to evaluate AI systems' ability to simultaneously navigate multi-session conversation history and perform deep reading comprehension within long documents. Comprising 50 micro-worlds and 1,000 QA pairs, each instance includes 3-5 personas, a temporal event graph, 3-5 real long documents (20,000-50,000 tokens from Caselaw Access Project), and multi-session conversations. Its defining feature is "Hybrid" questions, which constitute 75.1% of the dataset, requiring systems to first identify relevant documents from conversation history, then extract answers. Dataset quality was characterized using LLM-as-judge, yielding a median Cohen's κ= 0.634. Baseline evaluations across six configurations, including RAG and long-context LLMs, showed the best system (RAG-Both) achieved 0.358 F1 overall and 0.342 on Hybrid, revealing a significant "joint-retrieval gap" where document-only retrieval (RAG-Doc) collapsed to 0.267 on Hybrid questions.
Key takeaway
For AI Scientists and Machine Learning Engineers developing advanced systems for complex information retrieval, you should recognize that current architectures exhibit a significant "joint-retrieval gap" when combining conversational memory with long document reasoning. Your evaluation efforts should incorporate benchmarks like MemoryDocDataSet to expose these limitations. Focus your development on novel architectures that explicitly unify conversational history navigation and deep document comprehension to overcome this challenge.
Key insights
MemoryDocDataSet uniquely benchmarks AI systems on joint conversational memory and long document reasoning, revealing a critical performance gap.
Principles
- Existing benchmarks fail to evaluate combined conversational memory and long document reasoning.
- "Hybrid" questions expose a "joint-retrieval gap" in current AI architectures.
- LLM-as-judge can effectively characterize dataset quality via prompt-sensitivity analysis.
Method
The MemoryDocDataSet generation pipeline creates synthetic micro-worlds with personas, temporal event graphs, long documents, multi-session conversations, and categorized QA pairs, including "Hybrid" questions.
In practice
- Utilize MemoryDocDataSet to rigorously evaluate AI systems' joint reasoning capabilities.
- Prioritize developing architectures that unify conversational memory with long-document navigation.
- Employ LLM-as-judge for robust dataset quality assessment in complex NLP tasks.
Topics
- Conversational AI
- Long Document Reasoning
- AI Benchmarking
- Retrieval-Augmented Generation
- LLM Evaluation
- Dataset Generation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.