How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

2026-06-24 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, medium

Summary

A new benchmark, OCR-Robust, has been introduced to systematically evaluate the robustness of Vision-Language Models (VLMs) in OCR reasoning tasks under visual perturbations. The benchmark, detailed in 2606.26041, comprises 812 samples across two subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. Researchers conducted a pilot study of 18 perturbations, selecting 5 representative types at 3 severity levels. Robustness is measured using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI). Benchmarking 18 models, including proprietary systems and open-source VLMs, revealed that high clean accuracy does not guarantee strong robustness, and models show significant degradation on structure-sensitive OCR tasks, particularly with charts and tables.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or deploying Vision-Language Models for OCR tasks, you should prioritize comprehensive robustness evaluations beyond standard clean accuracy metrics. Your models may exhibit pronounced degradation on structure-sensitive content like charts and tables under visual perturbations, even if they perform well on clean document-like inputs. Integrate benchmarks like OCR-Robust into your development pipeline to identify and mitigate these critical vulnerabilities before deployment.

Key insights

VLMs' OCR reasoning robustness under visual degradation is critical but often not correlated with clean accuracy.

Principles

Higher clean accuracy does not necessarily imply stronger OCR reasoning robustness.
OCR tasks sensitive to structure, like charts and tables, are substantially more fragile to perturbations.

Method

The OCR-Robust benchmark evaluates VLM robustness using 5 visual perturbation types at 3 severity levels, measuring clean accuracy, RCR, WCR, and CRI across 812 samples.

In practice

Use OCR-Robust to assess VLM performance beyond clean accuracy metrics.
Prioritize robustness testing for VLMs processing structured data like charts or tables.

Topics

Vision-Language Models
OCR Robustness
Visual Perturbations
OCR-Robust Benchmark
Document Understanding
Chart Analysis

Code references

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.