DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

1916-02-17 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Document AI · Depth: Expert, extended

Summary

DocSeeker is a novel Multimodal Large Language Model (MLLM) designed to overcome performance degradation in long document understanding, particularly addressing low Signal-to-Noise Ratio (SNR) and supervision scarcity. Built upon Qwen-2.5-VL-7B, DocSeeker employs an "Analysis, Localization and Reasoning" (ALR) visual reasoning paradigm, inspired by human cognition, which explicitly grounds reasoning on specific document pages. Its two-stage training framework includes Supervised Fine-Tuning (SFT) on high-quality ALR Chain-of-Thought (CoT) data, generated via knowledge distillation from Gemini-2.5-Flash, followed by Evidence-aware Group Relative Policy Optimization (EviGRPO) for joint optimization of evidence localization and answer accuracy. Additionally, an Evidence-Guided Resolution Allocation (EGRA) strategy mitigates memory constraints during training by differentially allocating image resolutions. Experiments show DocSeeker achieves 30-60% performance gains on five document VQA benchmarks, robustly generalizes to ultra-long documents, and synergizes with visual Retrieval-Augmented Generation (RAG) systems.

Key takeaway

For AI Engineers developing MLLMs for long document understanding, DocSeeker's ALR paradigm offers a robust approach to improve accuracy and generalization. You should consider adopting structured reasoning workflows with explicit evidence grounding and a two-stage training process that combines supervised fine-tuning with reinforcement learning. This method, coupled with resolution allocation strategies, can significantly enhance performance on ultra-long documents and improve synergy with RAG systems.

Key insights

DocSeeker enhances long document MLLM performance via structured reasoning, two-stage training, and resolution allocation.

Principles

Explicit evidence grounding improves interpretability and reduces noise.
Combining SFT with RL refines reasoning and localization capabilities.
Differentiated resolution allocation optimizes memory and signal-to-noise ratio.

Method

DocSeeker uses a two-stage training: SFT with distilled ALR CoT data, then EviGRPO for joint localization and answer optimization. EGRA manages memory by varying page resolutions during training.

In practice

Implement page-aware input representation for visual tokens.
Use a multi-faceted reward function for evidence localization and answer accuracy.
Apply differential resolution for evidence vs. non-evidence pages in long documents.

Topics

DocSeeker
Long Document Understanding
Multimodal Large Language Models
Analysis-Localization-Reasoning
Evidence Grounding

Code references

yh-hust/DocSeeker

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.