KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

2026-03-17 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

A new synthetic OCR dataset, KazakhOCR, has been developed to address the scarcity of benchmarks for low-resource Kazakh scripts, specifically Arabic and Latin. This dataset comprises 7,219 images across Arabic, Cyrillic, and Latin scripts, incorporating variations in font, color, and noise to simulate real-world OCR challenges. Researchers evaluated three multimodal large language models (MLLMs)—Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct—on a subset of 600 images for OCR and language identification. The MLLMs performed poorly on Latin and Arabic script OCR, with character error rates (CERs) ranging from 26.4% to 31.0% for Latin and 35.5% to 72.5% for Arabic. They also largely failed to identify the Arabic script as Kazakh, often misclassifying it as Arabic, Farsi, or Kurdish. In contrast, a traditional OCR baseline using Tesseract achieved significantly lower CERs across all scripts, highlighting a substantial performance gap in MLLM capabilities for low-resource Abjad-based scripts.

Key takeaway

For AI Scientists and Research Scientists developing or deploying multimodal models for global applications, you should recognize the significant limitations of current MLLMs in processing low-resource scripts like Kazakh Arabic. Prioritize dedicated research and inclusive training data for these scripts to prevent critical failures in OCR and language identification, especially for communities relying on less common writing systems. Consider integrating traditional OCR methods as a fallback or primary solution for such languages.

Key insights

Current MLLMs struggle with OCR and language identification for low-resource Kazakh scripts, especially Arabic, performing worse than traditional OCR.

Principles

Synthetic datasets can bridge data gaps for low-resource languages.
MLLMs exhibit significant performance disparities across different scripts.
Traditional OCR often outperforms MLLMs on low-resource scripts.

Method

The KazakhOCR benchmark was constructed by collecting authentic text corpora, applying random balanced subsampling, and generating synthetic images with varied fonts (24-56pt), noise (4-18), blur (0-0.8), and color palettes with a minimum 4.5 contrast ratio.

In practice

Use Tesseract for Kazakh Arabic/Latin OCR over current MLLMs.
Prioritize Cyrillic script for MLLM-based Kazakh OCR tasks.
Explore prompt engineering for MLLM OCR on low-resource scripts.

Topics

Optical Character Recognition
Multimodal Large Language Models
Low-Resource Languages
Synthetic Data Generation
Kazakh Script OCR

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.