LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

LongMINT (Long-Horizon Memory under INTerference) is a new benchmark designed to evaluate memory-augmented agent systems in realistic, interference-heavy, long-horizon settings. It addresses limitations of existing benchmarks by focusing on dynamic interactions between evolving memories, rather than static, independent recall. LongMINT features long, interconnected contexts with frequently updated information across diverse domains such as state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits. The benchmark includes 15.6k question-answering pairs over contexts averaging 138.8k tokens, extending up to 1.8M tokens per instance. It assesses robustness to interference through single-target recall and multi-target aggregation tasks. Evaluations of 7 representative systems, including long-context LLMs, RAG, and memory-augmented agent frameworks, revealed consistently low performance, averaging 27.9% accuracy, particularly for aggregated reasoning. Analysis indicates performance is limited by retrieval and memory construction, with systems struggling to recall and reason over earlier facts that are revised or interfered with by subsequent context.

Key takeaway

For AI Engineers developing long-horizon agent systems, this research highlights critical weaknesses in current memory and retrieval mechanisms. Your focus should shift towards robust memory construction and retrieval strategies that can handle significant interference and frequently updated information. Prioritize developing systems capable of accurate multi-target aggregation, as this is where current models show the most significant performance degradation, impacting real-world agent reliability.

Key insights

Current memory-augmented agents struggle with interference and multi-target reasoning in long, dynamic contexts.

Principles

Method

LongMINT evaluates memory-augmented agents using long, interconnected contexts with frequent updates, across diverse domains and question types (single-target recall, multi-target aggregation).

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.