KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
Summary
A new synthetic OCR dataset, KazakhOCR, has been developed to address the scarcity of benchmarks for low-resource Kazakh scripts, specifically Arabic and Latin. This dataset comprises 7,219 images across Arabic, Cyrillic, and Latin scripts, incorporating variations in font, color, and noise to simulate real-world OCR challenges. Researchers evaluated three multimodal large language models (MLLMs)—Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct—on a subset of 600 images for OCR and language identification. The MLLMs performed poorly on Latin and Arabic script OCR, with character error rates (CERs) ranging from 26.4% to 31.0% for Latin and 35.5% to 72.5% for Arabic. They also largely failed to identify the Arabic script as Kazakh, often misclassifying it as Arabic, Farsi, or Kurdish. In contrast, a traditional OCR baseline using Tesseract achieved significantly lower CERs across all scripts, highlighting a substantial performance gap in MLLM capabilities for low-resource Abjad-based scripts.
Key takeaway
For AI Scientists and Research Scientists developing or deploying multimodal models for global applications, you should recognize the significant limitations of current MLLMs in processing low-resource scripts like Kazakh Arabic. Prioritize dedicated research and inclusive training data for these scripts to prevent critical failures in OCR and language identification, especially for communities relying on less common writing systems. Consider integrating traditional OCR methods as a fallback or primary solution for such languages.
Key insights
Current MLLMs struggle with OCR and language identification for low-resource Kazakh scripts, especially Arabic, performing worse than traditional OCR.
Principles
- Synthetic datasets can bridge data gaps for low-resource languages.
- MLLMs exhibit significant performance disparities across different scripts.
- Traditional OCR often outperforms MLLMs on low-resource scripts.
Method
The KazakhOCR benchmark was constructed by collecting authentic text corpora, applying random balanced subsampling, and generating synthetic images with varied fonts (24-56pt), noise (4-18), blur (0-0.8), and color palettes with a minimum 4.5 contrast ratio.
In practice
- Use Tesseract for Kazakh Arabic/Latin OCR over current MLLMs.
- Prioritize Cyrillic script for MLLM-based Kazakh OCR tasks.
- Explore prompt engineering for MLLM OCR on low-resource scripts.
Topics
- Optical Character Recognition
- Multimodal Large Language Models
- Low-Resource Languages
- Synthetic Data Generation
- Kazakh Script OCR
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.