KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

A new synthetic OCR dataset, KazakhOCR, has been developed to address the scarcity of benchmarks for low-resource Kazakh scripts, specifically Arabic and Latin. This dataset comprises 7,219 images across Arabic, Cyrillic, and Latin scripts, incorporating variations in font, color, and noise to simulate real-world OCR challenges. Researchers evaluated three multimodal large language models (MLLMs)—Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct—on a subset of 600 images for OCR and language identification. The MLLMs performed poorly on Latin and Arabic script OCR, with character error rates (CERs) ranging from 26.4% to 31.0% for Latin and 35.5% to 72.5% for Arabic. They also largely failed to identify the Arabic script as Kazakh, often misclassifying it as Arabic, Farsi, or Kurdish. In contrast, a traditional OCR baseline using Tesseract achieved significantly lower CERs across all scripts, highlighting a substantial performance gap in MLLM capabilities for low-resource Abjad-based scripts.

Key takeaway

For AI Scientists and Research Scientists developing or deploying multimodal models for global applications, you should recognize the significant limitations of current MLLMs in processing low-resource scripts like Kazakh Arabic. Prioritize dedicated research and inclusive training data for these scripts to prevent critical failures in OCR and language identification, especially for communities relying on less common writing systems. Consider integrating traditional OCR methods as a fallback or primary solution for such languages.

Key insights

Current MLLMs struggle with OCR and language identification for low-resource Kazakh scripts, especially Arabic, performing worse than traditional OCR.

Principles

Method

The KazakhOCR benchmark was constructed by collecting authentic text corpora, applying random balanced subsampling, and generating synthetic images with varied fonts (24-56pt), noise (4-18), blur (0-0.8), and color palettes with a minimum 4.5 contrast ratio.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.