RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

RealDocBench is a new two-track benchmark designed to evaluate document parsing systems on real-world regulated documents, addressing limitations of existing benchmarks. The QA track features 1,356 field-level questions across 581 documents from mortgage, finance, supply chain, and medical/healthcare domains, using typed "gold_dict" answers and scoring per-field and strict per-question accuracy. The layout track comprises 1,500 human-verified page images with COCO-style bounding box annotations under a nine-class taxonomy, scored with an adjacency-aware matcher. Eighteen systems, including commercial APIs, general-purpose VLMs, and open-source OCR models, were evaluated under a uniform protocol, reporting accuracy, per-page cost, and cache-busted latency. Results reveal a wide performance spread, with Extend Performance v2 leading at 96.0% per-field and 90.9% per-question accuracy, and highlight a challenging medical sub-domain and significant cost/latency trade-offs.

Key takeaway

For Directors of AI/ML evaluating document parsing solutions for regulated workflows, you must move beyond traditional benchmarks. RealDocBench demonstrates that field-level accuracy, cost, and latency vary significantly across systems and document types. Prioritize solutions proven on real, messy documents, especially for challenging domains like medical records. Your selection process should explicitly weigh accuracy against operational costs and processing speed, rather than relying on single aggregate scores.

Key insights

RealDocBench offers a field-level, real-world benchmark exposing true document parsing system performance and operational trade-offs.

Principles

Method

A two-stage QA protocol passes parser markdown through a fixed extraction LLM, then scores against typed "gold_dict" answers using type-aware comparison and fuzzy matching.

In practice

Topics

Code references

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.