Does anyone have a copy of the ICDAR2013 Chinese Handwriting Competition Dataset? [R]

2026-05-29 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

A user is seeking the ICDAR2013 Chinese Handwriting Recognition Competition Dataset, as its official download page at "https://nlpr.ia.ac.cn/databases/handwriting/Download.html" has been consistently down for weeks. Extensive searches across platforms like Kaggle, HuggingFace, Google Drive, Baidu Netdisk, and GitHub, along with advanced Google dorking, yielded no direct copies of the original dataset. However, an update reveals a partial solution: the "shannanyinxiang/PageNet" GitHub repository contains the dataset preprocessed into LMDB format. While this version lacks line and word-level annotations, character-level annotations from the "newbie2niubility/TGC-Diff" repository could potentially be used for reconstruction. This reconstruction would need to account for differences in page boundaries and margins present in the TGC-Diff annotations compared to the cropped PageNet LMDB images.

Key takeaway

For AI scientists or NLP engineers seeking the ICDAR2013 Chinese Handwriting Recognition Dataset, be aware that the official source is offline. If you require this dataset, prioritize checking community-maintained GitHub repositories like "shannanyinxiang/PageNet" for preprocessed versions. You may need to combine data from multiple sources, such as "newbie2niubility/TGC-Diff" for character-level annotations, and carefully manage discrepancies in data formatting, like page boundaries and margins, to reconstruct the full dataset.

Key insights

The ICDAR2013 Chinese Handwriting Dataset is difficult to obtain, but a partial LMDB version exists.

Principles

Official dataset links can become unavailable.
Community repositories may offer preprocessed data.
Data reconstruction requires careful annotation alignment.

Method

To reconstruct full annotations for the ICDAR2013 dataset from the PageNet LMDB, align character-level annotations from TGC-Diff, adjusting for page boundaries and margins.

In practice

Check GitHub for preprocessed dataset versions.
Cross-reference multiple repositories for annotations.
Verify annotation alignment for cropped images.

Topics

ICDAR2013 Dataset
Chinese Handwriting Recognition
Dataset Availability
LMDB Format
Data Reconstruction
Annotation Alignment

Code references

Best for: Research Scientist, Machine Learning Engineer, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.