Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

A new, free multilingual corpus, named indic-hplt-v1, has been released on HuggingFace, comprising approximately 9.8 million web documents. This dataset covers 11 languages, including Hindi (hi), Bengali (bn), Tamil (ta), Telugu (te), Marathi (mr), Gujarati (gu), Kannada (kn), Malayalam (ml), Punjabi (pa), Urdu (ur), and English (en). The corpus contains roughly 8.4 billion tokens and is distributed under a CC0 license, making it freely available for research and development. This resource addresses a significant challenge in finding clean, public-domain data for Indic languages, particularly for tasks like multilingual translation and preprocessing.

Key takeaway

For AI engineers and researchers working with Indic languages, this new 9.8 million document, 11-language corpus on HuggingFace offers a critical resource. You should integrate this CC0-licensed dataset into your multilingual model training and natural language processing pipelines, especially for translation or preprocessing tasks, to overcome data scarcity challenges.

Key insights

A new 9.8M document, 11-language Indic multilingual corpus is now freely available under CC0.

Principles

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.