How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations
Summary
A new benchmark, OCR-Robust, has been introduced to systematically evaluate the robustness of Vision-Language Models (VLMs) in OCR reasoning tasks under visual perturbations. The benchmark, detailed in 2606.26041, comprises 812 samples across two subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. Researchers conducted a pilot study of 18 perturbations, selecting 5 representative types at 3 severity levels. Robustness is measured using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI). Benchmarking 18 models, including proprietary systems and open-source VLMs, revealed that high clean accuracy does not guarantee strong robustness, and models show significant degradation on structure-sensitive OCR tasks, particularly with charts and tables.
Key takeaway
For AI Scientists and Machine Learning Engineers developing or deploying Vision-Language Models for OCR tasks, you should prioritize comprehensive robustness evaluations beyond standard clean accuracy metrics. Your models may exhibit pronounced degradation on structure-sensitive content like charts and tables under visual perturbations, even if they perform well on clean document-like inputs. Integrate benchmarks like OCR-Robust into your development pipeline to identify and mitigate these critical vulnerabilities before deployment.
Key insights
VLMs' OCR reasoning robustness under visual degradation is critical but often not correlated with clean accuracy.
Principles
- Higher clean accuracy does not necessarily imply stronger OCR reasoning robustness.
- OCR tasks sensitive to structure, like charts and tables, are substantially more fragile to perturbations.
Method
The OCR-Robust benchmark evaluates VLM robustness using 5 visual perturbation types at 3 severity levels, measuring clean accuracy, RCR, WCR, and CRI across 812 samples.
In practice
- Use OCR-Robust to assess VLM performance beyond clean accuracy metrics.
- Prioritize robustness testing for VLMs processing structured data like charts or tables.
Topics
- Vision-Language Models
- OCR Robustness
- Visual Perturbations
- OCR-Robust Benchmark
- Document Understanding
- Chart Analysis
Code references
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.