OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
Summary
OpenLID-v3 is a newly released, fully open-source language identification (LID) system designed to enhance precision for closely related languages and better differentiate natural language from noise in web data. This extended classifier, covering 194 languages plus a "not-a-language" class, addresses issues found in its predecessor, OpenLID-v2, such as misclassifying Serbian Latin script and the "trash bin phenomenon" where noise was assigned to valid languages. Development involved adding more training data, merging language variants like 8 Arabic dialects and Persian varieties, and incorporating a zxx_Zxxx label for non-language content. Evaluated against GlotLID and OpenLID-v2 on benchmarks including FLORES+, UDHR, and FastSpell, alongside new datasets for Bosnian/Croatian/Serbian, Romance, and Scandinavian languages, OpenLID-v3 demonstrates comparable or improved precision. Top-1 ensembling with GlotLID yielded the highest precision, though it sometimes reduced coverage.
Key takeaway
For Machine Learning Engineers building multilingual datasets, OpenLID-v3 offers improved language identification, especially for closely related languages and noisy web content. You should consider integrating OpenLID-v3, potentially ensembled with GlotLID, to enhance precision in your data curation pipelines. Be aware that ensembling might reduce coverage for some low-resource languages, requiring careful validation. Prioritize creating fine-grained, language-specific benchmarks for accurate evaluation.
Key insights
Improving language identification for closely related languages requires specialized benchmarks and handling of non-language content.
Principles
- Standard LID benchmarks are insufficient for similar languages.
- Ensemble models can boost precision but may reduce coverage.
- Explicitly labeling noise (e.g., zxx_Zxxx) improves classification.
Method
OpenLID-v3 development involved extending training data, merging problematic language variant clusters (e.g., Arabic dialects), and introducing a zxx_Zxxx class for non-language content. Ensemble approaches with GlotLID were also explored.
In practice
- Create specific benchmarks for closely related language groups.
- Incorporate a "not-a-language" class in LID models.
- Consider ensembling multiple LID models for precision gains.
Topics
- Language Identification
- OpenLID-v3
- Closely Related Languages
- Multilingual Datasets
- Ensemble Models
- Web Data Curation
Code references
- hplt-project/openlid
- hplt-project/openlid-v3-evaluation
- hplt-project/release3_inspection
- ltgoslo/slide
- cisnlp/GlotLID
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.