LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
Summary
LongMINT (Long-Horizon Memory under INTerference) is a new benchmark designed to evaluate memory-augmented agent systems in realistic, interference-heavy, long-horizon settings. It addresses limitations of existing benchmarks by focusing on dynamic interactions between evolving memories, rather than static, independent recall. LongMINT features long, interconnected contexts with frequently updated information across diverse domains such as state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. The benchmark includes 15.6k question-answering pairs over contexts averaging 138.8k tokens, extending up to 1.8M tokens per instance. It assesses robustness to interference through single-target recall and multi-target aggregation tasks. Evaluations of 7 representative systems, including long-context LLMs, RAG, and memory-augmented agent frameworks, revealed consistently low performance, averaging 27.9% accuracy, particularly for aggregated reasoning. Analysis indicates performance is limited by retrieval and memory construction, with systems struggling to recall and reason over earlier facts that are revised or interfered with by subsequent context.
Key takeaway
For AI Engineers developing long-horizon agent systems, this research highlights critical weaknesses in current memory and retrieval mechanisms. Your focus should shift towards robust memory construction and retrieval strategies that can handle significant interference and frequently updated information. Prioritize developing systems capable of accurate multi-target aggregation, as this is where current models show the most significant performance degradation, impacting real-world agent reliability.
Key insights
Current memory-augmented agents struggle with interference and multi-target reasoning in long, dynamic contexts.
Principles
- Memory interference degrades agent performance.
- Aggregated reasoning is a major challenge for agents.
Method
LongMINT evaluates memory-augmented agents using long, interconnected contexts with frequent updates, across diverse domains and question types (single-target recall, multi-target aggregation).
In practice
- Focus on improving retrieval in memory systems.
- Enhance memory construction for dynamic contexts.
Topics
- LongMINT Benchmark
- Long-Horizon Agents
- Memory Interference
- Memory-Augmented Systems
- Retrieval-Augmented Generation
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.