RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents
Summary
RealDocBench is a new two-track benchmark designed to evaluate document parsing systems on real-world regulated documents, addressing limitations of existing benchmarks. The QA track features 1,356 field-level questions across 581 documents from mortgage, finance, supply chain, and medical/healthcare domains, using typed "gold_dict" answers and scoring per-field and strict per-question accuracy. The layout track comprises 1,500 human-verified page images with COCO-style bounding box annotations under a nine-class taxonomy, scored with an adjacency-aware matcher. Eighteen systems, including commercial APIs, general-purpose VLMs, and open-source OCR models, were evaluated under a uniform protocol, reporting accuracy, per-page cost, and cache-busted latency. Results reveal a wide performance spread, with Extend Performance v2 leading at 96.0% per-field and 90.9% per-question accuracy, and highlight a challenging medical sub-domain and significant cost/latency trade-offs.
Key takeaway
For Directors of AI/ML evaluating document parsing solutions for regulated workflows, you must move beyond traditional benchmarks. RealDocBench demonstrates that field-level accuracy, cost, and latency vary significantly across systems and document types. Prioritize solutions proven on real, messy documents, especially for challenging domains like medical records. Your selection process should explicitly weigh accuracy against operational costs and processing speed, rather than relying on single aggregate scores.
Key insights
RealDocBench offers a field-level, real-world benchmark exposing true document parsing system performance and operational trade-offs.
Principles
- Field-level value extraction is paramount for regulated document processing.
- Real-world, messy documents are essential for robust parser evaluation.
- Comprehensive benchmarks must report cost, latency, and granular accuracy.
Method
A two-stage QA protocol passes parser markdown through a fixed extraction LLM, then scores against typed "gold_dict" answers using type-aware comparison and fuzzy matching.
In practice
- Prioritize field-level value accuracy for high-stakes document parsing.
- Scrutinize system performance on challenging domains like medical records.
- Balance parsing accuracy with per-page cost and latency for deployment.
Topics
- Document Parsing
- Field-Level QA
- Layout Understanding
- Regulated Documents
- Benchmark Evaluation
- Cost-Accuracy Trade-offs
Code references
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.