How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, medium

Summary

A new benchmark, OCR-Robust, has been introduced to systematically evaluate the robustness of Vision-Language Models (VLMs) in OCR reasoning tasks under visual perturbations. The benchmark, detailed in 2606.26041, comprises 812 samples across two subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. Researchers conducted a pilot study of 18 perturbations, selecting 5 representative types at 3 severity levels. Robustness is measured using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI). Benchmarking 18 models, including proprietary systems and open-source VLMs, revealed that high clean accuracy does not guarantee strong robustness, and models show significant degradation on structure-sensitive OCR tasks, particularly with charts and tables.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or deploying Vision-Language Models for OCR tasks, you should prioritize comprehensive robustness evaluations beyond standard clean accuracy metrics. Your models may exhibit pronounced degradation on structure-sensitive content like charts and tables under visual perturbations, even if they perform well on clean document-like inputs. Integrate benchmarks like OCR-Robust into your development pipeline to identify and mitigate these critical vulnerabilities before deployment.

Key insights

VLMs' OCR reasoning robustness under visual degradation is critical but often not correlated with clean accuracy.

Principles

Method

The OCR-Robust benchmark evaluates VLM robustness using 5 visual perturbation types at 3 severity levels, measuring clean accuracy, RCR, WCR, and CRI across 812 samples.

In practice

Topics

Code references

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.