Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]
Summary
A new, free multilingual corpus, named indic-hplt-v1, has been released on HuggingFace, comprising approximately 9.8 million web documents. This dataset covers 11 languages, including Hindi (hi), Bengali (bn), Tamil (ta), Telugu (te), Marathi (mr), Gujarati (gu), Kannada (kn), Malayalam (ml), Punjabi (pa), Urdu (ur), and English (en). The corpus contains roughly 8.4 billion tokens and is distributed under a CC0 license, making it freely available for research and development. This resource addresses a significant challenge in finding clean, public-domain data for Indic languages, particularly for tasks like multilingual translation and preprocessing.
Key takeaway
For AI engineers and researchers working with Indic languages, this new 9.8 million document, 11-language corpus on HuggingFace offers a critical resource. You should integrate this CC0-licensed dataset into your multilingual model training and natural language processing pipelines, especially for translation or preprocessing tasks, to overcome data scarcity challenges.
Key insights
A new 9.8M document, 11-language Indic multilingual corpus is now freely available under CC0.
Principles
- Public domain data for Indic languages is scarce.
- Large, clean multilingual corpora aid research.
In practice
- Use for multilingual translation tasks.
- Apply for Indic language preprocessing.
Topics
- Indic Languages
- Multilingual Corpus
- HuggingFace Datasets
- CC0 License
- Natural Language Processing
Best for: AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.