MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

MARDoc is a Memory-Aware Refinement Agent framework designed for multimodal long-document question answering (QA). It addresses the issue in existing iterative retrieval-reasoning agents where a single, growing context mixes retrieval traces, observations, and intermediate reasoning, leading to scattered evidence and noisy multi-hop reasoning. MARDoc decouples the QA process into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. These agents utilize a dynamically updated structured memory instead of a full accumulated interaction history, which reduces context noise while preserving critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench demonstrate MARDoc's strong performance, outperforming same-backbone baselines and validating the effectiveness of its structured memory approach for agentic document QA.

Key takeaway

For Machine Learning Engineers developing multimodal long-document QA systems, MARDoc's agentic framework offers a clear path to improve performance. You should consider decoupling your QA pipeline into specialized agents for retrieval, refinement, and reflection. Implementing a dynamically updated structured memory, rather than a monolithic context, will significantly reduce noise and preserve critical evidence, leading to more accurate multi-hop reasoning. This approach can enhance your system's ability to handle complex, lengthy documents effectively.

Key insights

MARDoc improves long-document QA by using specialized agents and structured memory to reduce context noise and preserve critical evidence.

Principles

Method

MARDoc employs an Explorer for multimodal retrieval, a Refiner for distilling traces into structured memories, and a Reflector for feedback, all relying on a dynamically updated structured memory.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.