[R] IDP Leaderboard: Open benchmark for document AI across 16 VLMs, 9,000+ documents, 3 benchmark suites
Summary
The IDP Leaderboard introduces an open evaluation framework for document understanding, assessing 16 large vision-language models (VLMs) across three benchmark suites: OlmOCR, OmniDoc, and IDP Core. The IDP Core benchmark specifically covers key information extraction (KIE), table extraction, visual question answering (VQA), optical character recognition (OCR), classification, and long document processing. Key findings indicate that Gemini 3.1 Pro leads overall with a score of 83.2, though the top five models are closely clustered within 2.4 points. Cheaper model variants like Flash and Sonnet achieve nearly identical extraction quality to their flagship counterparts, with differentiation primarily appearing in reasoning-heavy tasks such as VQA. GPT-5.4 demonstrates a substantial improvement over GPT-4.1, jumping from 70 to 81 overall and from 42% to 91% on DocVQA. Sparse unstructured tables remain the most challenging task, with most models scoring below 55%, and handwriting OCR performance peaks at 76%. A Results Explorer is also provided, allowing users to view ground truth alongside raw model predictions for each document.
Key takeaway
For teams evaluating document AI solutions, you should prioritize practical considerations like data cleanup, schema alignment, and post-processing over marginal differences in top-tier VLM scores. Given the tight performance gap among leading models, the "boring parts" of implementation often dictate real-world success. Utilize the Results Explorer to visually compare model predictions against ground truth, which is more informative than raw scores for selecting a model that aligns with your specific document types and extraction needs.
Key insights
An open benchmark evaluates 16 VLMs on document understanding, revealing tight performance margins among top models.
Principles
- Cheaper model variants can match flagship extraction quality.
- Reasoning tasks differentiate VLM performance.
- Sparse tables and handwriting OCR remain challenging.
Method
The IDP Leaderboard evaluates VLMs using OlmOCR, OmniDoc, and IDP Core benchmarks, covering KIE, table extraction, VQA, OCR, classification, and long document processing, with a Results Explorer for detailed prediction analysis.
In practice
- Review model predictions against ground truth.
- Prioritize data cleanup over marginal model gains.
- Focus on post-processing for real-world table extraction.
Topics
- Document AI
- Large Vision Models
- Document Understanding Benchmarks
- Information Extraction
- Optical Character Recognition
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.