Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Advanced, extended

Summary

The Urdu Katib Handwritten Dataset (UKHD) is introduced as the first offline Urdu handwritten text lines dataset specifically curated from historical documents written by Katibs in the Nastalique calligraphic style. This dataset addresses the scarcity of benchmark resources for Urdu Handwritten Text Recognition (UHTR), a challenging task due to Urdu's cursive, diagonal, overlapping, and context-sensitive script. The study also evaluates CRNN-based hybrid models on UKHD, identifying the CNN-BGRU-CTC model as the most robust performer. This model achieved average Character Error Rate (CER) of 5.2% and Word Error Rate (WER) of 16.9% on the test set, demonstrating strong performance in recognizing historical Urdu script and its unique diacritics.

Key takeaway

For AI Scientists and Machine Learning Engineers working on historical document preservation, this research highlights the critical need for specialized datasets like UKHD. You should prioritize CRNN-based models, specifically the CNN-BGRU-CTC architecture, for Urdu Handwritten Text Recognition, as it demonstrated superior performance (5.2% CER). Consider integrating advanced image enhancement and transformer-based post-processing to further improve recognition rates for challenging cursive scripts.

Key insights

The Urdu Katib Handwritten Dataset (UKHD) and CRNN-BGRU-CTC model establish a baseline for historical Urdu handwritten text recognition.

Principles

Cursive scripts like Urdu pose unique HTR challenges due to diagonal, overlapping, and context-sensitive characters.
Bidirectional RNN variants (BLSTM, BGRU) are superior for image-based sequence recognition in cursive scripts.
Implicit segmentation-based recognition with hybrid deep learning models is effective for UHTR.

Method

UKHD generation uses semi-automatic image acquisition, preprocessing (skew correction via HPP), line segmentation (HPP-based, manual adjustment), and annotation (Cloud Vision API with manual correction). CRNN models combine CNN feature extraction, RNN sequence modeling, and CTC for alignment-free transcription.

In practice

Utilize the UKHD dataset for developing robust UHTR systems for historical Urdu literature.
Implement CNN-BGRU-CTC architecture for optimal performance in Urdu Katib Handwriting Recognition (UKHR).
Employ semi-automatic methods for efficient dataset creation, combining automated transcription with manual review.

Topics

Urdu Handwritten Text Recognition
Urdu Katib Handwritten Dataset
CRNN Models
Nastalique Calligraphy
Historical Document Preservation
Deep Learning for OCR

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.