MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

MARDoc, a Memory-Aware Refinement Agent framework, addresses limitations in existing iterative retrieval-reasoning agents for multimodal long-document question answering. Current systems often suffer from context noise and dilution due to a single growing context that mixes retrieval traces and reasoning. MARDoc decouples the QA process into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. This framework relies on a dynamically updated structured memory, rather than a full accumulated interaction history, to reduce context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench demonstrate that MARDoc achieves strong results, outperforming same-backbone baselines and validating the effectiveness of structured memory for agentic document QA.

Key takeaway

For Machine Learning Engineers developing multimodal long-document QA systems, MARDoc offers a robust framework to mitigate context dilution and improve multi-hop reasoning. You should consider adopting a decoupled agent architecture with structured memory to enhance evidence distillation and iterative refinement. This approach can significantly boost performance on complex benchmarks like MMLongBench-Doc and DocBench, ensuring more accurate and reliable answers from your models.

Key insights

MARDoc uses a structured, dynamically updated memory to overcome context dilution in multimodal long-document QA agents.

Principles

Method

MARDoc employs an Explore-Refine-Reflect loop. The Explorer retrieves, the Refiner distills traces into structured evidence and reasoning memories, and the Reflector assesses sufficiency and provides feedback.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.