Why Nepali OCR Is Brutally Hard — And How We Finally Solved It

· Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, long

Summary

An intelligent Devanagari OCR system has been developed to automate the extraction of structured data from Nepali citizenship cards and pre-fill government forms. This system addresses unique challenges posed by the Devanagari script, such as the "Shirorekha" problem (connected characters), three-zone vertical layouts, and conjuncts (fused consonants), which break assumptions of Latin-script OCR engines. The architecture employs a four-stage deep learning pipeline: document alignment using YOLOv8-Small and homography for perspective correction; layout analysis with PP-StructureV3 and MTCNN for portrait extraction; a hybrid OCR/HTR engine combining PaddleOCR-VL for printed text and a fine-tuned TrOCR model for handwritten input; and a SpaCy NER pipeline for structured data extraction. A human-in-the-loop safety net, triggered by confidence scores below 78%, logs corrections to a MySQL audit table, feeding back into model training. Synthetic data generation and RoundTripOCR are used to overcome data scarcity and generate realistic error patterns for model improvement.

Key takeaway

For AI Engineers developing solutions for complex, non-Latin scripts or mixed printed/handwritten documents, consider adopting a hybrid OCR architecture. Your team should segment the problem into distinct stages, using specialized models for tasks like document alignment, layout analysis, and text recognition. Implement a human-in-the-loop system with confidence scoring to ensure accuracy and continuously generate valuable training data from corrections, which is crucial for improving model performance in data-scarce environments.

Key insights

A hybrid deep learning pipeline overcomes complex script challenges to automate document data extraction and form pre-filling.

Principles

Method

The system uses YOLOv8 for alignment, PP-StructureV3 for layout, MTCNN for face detection, PaddleOCR-VL for printed text, TrOCR for handwritten text, and SpaCy NER for data extraction, all supported by synthetic data and RoundTripOCR.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.