Does anyone have a copy of the ICDAR2013 Chinese Handwriting Competition Dataset? [R]
Summary
A user is seeking the ICDAR2013 Chinese Handwriting Recognition Competition Dataset, as its official download page at "https://nlpr.ia.ac.cn/databases/handwriting/Download.html" has been consistently down for weeks. Extensive searches across platforms like Kaggle, HuggingFace, Google Drive, Baidu Netdisk, and GitHub, along with advanced Google dorking, yielded no direct copies of the original dataset. However, an update reveals a partial solution: the "shannanyinxiang/PageNet" GitHub repository contains the dataset preprocessed into LMDB format. While this version lacks line and word-level annotations, character-level annotations from the "newbie2niubility/TGC-Diff" repository could potentially be used for reconstruction. This reconstruction would need to account for differences in page boundaries and margins present in the TGC-Diff annotations compared to the cropped PageNet LMDB images.
Key takeaway
For AI scientists or NLP engineers seeking the ICDAR2013 Chinese Handwriting Recognition Dataset, be aware that the official source is offline. If you require this dataset, prioritize checking community-maintained GitHub repositories like "shannanyinxiang/PageNet" for preprocessed versions. You may need to combine data from multiple sources, such as "newbie2niubility/TGC-Diff" for character-level annotations, and carefully manage discrepancies in data formatting, like page boundaries and margins, to reconstruct the full dataset.
Key insights
The ICDAR2013 Chinese Handwriting Dataset is difficult to obtain, but a partial LMDB version exists.
Principles
- Official dataset links can become unavailable.
- Community repositories may offer preprocessed data.
- Data reconstruction requires careful annotation alignment.
Method
To reconstruct full annotations for the ICDAR2013 dataset from the PageNet LMDB, align character-level annotations from TGC-Diff, adjusting for page boundaries and margins.
In practice
- Check GitHub for preprocessed dataset versions.
- Cross-reference multiple repositories for annotations.
- Verify annotation alignment for cropped images.
Topics
- ICDAR2013 Dataset
- Chinese Handwriting Recognition
- Dataset Availability
- LMDB Format
- Data Reconstruction
- Annotation Alignment
Code references
Best for: Research Scientist, Machine Learning Engineer, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.