Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Advanced, extended

Summary

The Urdu Katib Handwritten Dataset (UKHD) is introduced as the first offline Urdu handwritten text lines dataset specifically curated from historical documents written by Katibs in the Nastalique calligraphic style. This dataset addresses the scarcity of benchmark resources for Urdu Handwritten Text Recognition (UHTR), a challenging task due to Urdu's cursive, diagonal, overlapping, and context-sensitive script. The study also evaluates CRNN-based hybrid models on UKHD, identifying the CNN-BGRU-CTC model as the most robust performer. This model achieved average Character Error Rate (CER) of 5.2% and Word Error Rate (WER) of 16.9% on the test set, demonstrating strong performance in recognizing historical Urdu script and its unique diacritics.

Key takeaway

For AI Scientists and Machine Learning Engineers working on historical document preservation, this research highlights the critical need for specialized datasets like UKHD. You should prioritize CRNN-based models, specifically the CNN-BGRU-CTC architecture, for Urdu Handwritten Text Recognition, as it demonstrated superior performance (5.2% CER). Consider integrating advanced image enhancement and transformer-based post-processing to further improve recognition rates for challenging cursive scripts.

Key insights

The Urdu Katib Handwritten Dataset (UKHD) and CRNN-BGRU-CTC model establish a baseline for historical Urdu handwritten text recognition.

Principles

Method

UKHD generation uses semi-automatic image acquisition, preprocessing (skew correction via HPP), line segmentation (HPP-based, manual adjustment), and annotation (Cloud Vision API with manual correction). CRNN models combine CNN feature extraction, RNN sequence modeling, and CTC for alignment-free transcription.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.