BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval
Summary
BullingerDB is a new large-scale benchmark dataset designed for historical document analysis, focusing on handwritten text recognition (HTR) and writer retrieval. Derived from the correspondence of Heinrich Bullinger (1504-1575), the dataset encompasses 20,898 pages and 499,222 text lines from 796 writers over six decades. It features significant stylistic variation, multilingual content primarily in Latin and Early New High German, and includes meta-information like writer identity and time. Evaluations on BullingerDB show TrOCR achieving a Character Error Rate (CER) of 9.1% for text recognition. For writer retrieval, a new temporal nDCG metric was introduced, with mAP scores reaching 78.3%, highlighting challenges from long-term stylistic changes. This dataset aims to establish a new benchmark for multilingual historical text recognition and temporally-aware writer analysis.
Key takeaway
For Machine Learning Engineers developing historical document analysis systems, BullingerDB offers a critical benchmark. You should consider integrating this large, multilingual dataset to train and evaluate your handwritten text recognition and writer retrieval models, especially when dealing with stylistic variation over time. Apply the introduced temporal nDCG metric to assess your writer retrieval solutions more accurately, ensuring they account for chronological changes in handwriting.
Key insights
BullingerDB is a large, multilingual dataset for historical HTR and writer retrieval, introducing temporal metrics and highlighting challenges from long-term stylistic variation.
Principles
- Historical HTR benefits from multilingual, time-aware data.
- Long-term stylistic shifts complicate writer retrieval.
- Temporal metrics are vital for historical writer analysis.
Method
The study introduces BullingerDB, a dataset for historical document analysis, and proposes a temporal nDCG metric to assess time-aware writer retrieval performance, complementing standard mAP scores.
In practice
- Train HTR models on multilingual historical texts.
- Evaluate writer retrieval with temporal nDCG.
- Develop models robust to long-term style changes.
Topics
- BullingerDB
- Handwritten Text Recognition
- Writer Retrieval
- Historical Document Analysis
- Multilingual Text
- Temporal Metrics
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.