MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA
Summary
MARDoc, a Memory-Aware Refinement Agent framework, addresses limitations in existing iterative retrieval-reasoning agents for multimodal long-document question answering. Current systems often suffer from context noise and dilution due to a single growing context that mixes retrieval traces and reasoning. MARDoc decouples the QA process into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. This framework relies on a dynamically updated structured memory, rather than a full accumulated interaction history, to reduce context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench demonstrate that MARDoc achieves strong results, outperforming same-backbone baselines and validating the effectiveness of structured memory for agentic document QA.
Key takeaway
For Machine Learning Engineers developing multimodal long-document QA systems, MARDoc offers a robust framework to mitigate context dilution and improve multi-hop reasoning. You should consider adopting a decoupled agent architecture with structured memory to enhance evidence distillation and iterative refinement. This approach can significantly boost performance on complex benchmarks like MMLongBench-Doc and DocBench, ensuring more accurate and reliable answers from your models.
Key insights
MARDoc uses a structured, dynamically updated memory to overcome context dilution in multimodal long-document QA agents.
Principles
- Decouple retrieval, refinement, and reflection.
- Structured memory reduces context noise.
- Iterative assessment improves reasoning quality.
Method
MARDoc employs an Explore-Refine-Reflect loop. The Explorer retrieves, the Refiner distills traces into structured evidence and reasoning memories, and the Reflector assesses sufficiency and provides feedback.
In practice
- Use MinerU2.5 for precise document parsing.
- Generate visual descriptions with Qwen3-VL-235B-A22B-Instruct.
- Implement multi-granularity toolsets for retrieval.
Topics
- Multimodal QA
- Long Document Processing
- Agentic AI
- Structured Memory
- Retrieval-Augmented Generation
- MMLongBench-Doc
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.