How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

2026-05-08 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Document Understanding · Depth: Expert, medium

Summary

PureDocBench is a new, programmatically generated, source-traceable benchmark for document parsing models, introduced to address issues with the widely used OmniDocBench dataset. OmniDocBench, with 1,355 pages and 21,353 evaluator-scored blocks, was found to contain 2,580 errors (12.08%) and faces contamination risks due to its public availability. PureDocBench covers 10 domains, 66 subcategories, and 1,475 pages, each rendered in clean, digitally degraded, and real-degraded versions, totaling 4,425 images. Evaluations of 40 models, including pipeline specialists, end-to-end specialists, and general-purpose VLMs, reveal that the best model scores only ~74 out of 100, with a 44.6-point gap between top and bottom performers. Specialist parsers with <=4B parameters often rival or surpass VLMs 5-100x larger, though formula recognition remains a bottleneck, with no model exceeding 67%. General VLMs show greater robustness to degradation, losing only 0.99/8.52 points under digital/real degradation, compared to 4.90/14.21 for pipeline specialists, indicating that clean-only evaluations are misleading for deployment.

Key takeaway

For AI Engineers and Research Scientists evaluating document parsing models for real-world deployment, you should prioritize benchmarks that include degraded data and verifiable annotations, like PureDocBench. Relying solely on clean-only evaluations can lead to misleading performance assessments and suboptimal model selection, as general VLMs demonstrate superior robustness to degradation compared to pipeline specialists. Focus on models that perform well across varied conditions, not just ideal ones.

Key insights

Document parsing is far from solved, with specialist models often outperforming larger general VLMs, especially under degradation.

Principles

Benchmark quality impacts model evaluation.
Degradation reveals true model robustness.
Specialized models can be more efficient.

Method

PureDocBench programmatically generates document images from HTML/CSS, producing verifiable annotations from the same source across clean, digitally degraded, and real-degraded settings for comprehensive evaluation.

In practice

Prioritize benchmarks with source-traceable annotations.
Evaluate models across diverse degradation levels.
Consider smaller, specialized models for efficiency.

Topics

Document Parsing
PureDocBench
OmniDocBench
Benchmark Evaluation
Vision-Language Models

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.