OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Expert, extended

Summary

OpenLID-v3 is a newly released, fully open-source language identification (LID) system designed to enhance precision for closely related languages and better differentiate natural language from noise in web data. This extended classifier, covering 194 languages plus a "not-a-language" class, addresses issues found in its predecessor, OpenLID-v2, such as misclassifying Serbian Latin script and the "trash bin phenomenon" where noise was assigned to valid languages. Development involved adding more training data, merging language variants like 8 Arabic dialects and Persian varieties, and incorporating a zxx_Zxxx label for non-language content. Evaluated against GlotLID and OpenLID-v2 on benchmarks including FLORES+, UDHR, and FastSpell, alongside new datasets for Bosnian/Croatian/Serbian, Romance, and Scandinavian languages, OpenLID-v3 demonstrates comparable or improved precision. Top-1 ensembling with GlotLID yielded the highest precision, though it sometimes reduced coverage.

Key takeaway

For Machine Learning Engineers building multilingual datasets, OpenLID-v3 offers improved language identification, especially for closely related languages and noisy web content. You should consider integrating OpenLID-v3, potentially ensembled with GlotLID, to enhance precision in your data curation pipelines. Be aware that ensembling might reduce coverage for some low-resource languages, requiring careful validation. Prioritize creating fine-grained, language-specific benchmarks for accurate evaluation.

Key insights

Improving language identification for closely related languages requires specialized benchmarks and handling of non-language content.

Principles

Method

OpenLID-v3 development involved extending training data, merging problematic language variant clusters (e.g., Arabic dialects), and introducing a zxx_Zxxx class for non-language content. Ensemble approaches with GlotLID were also explored.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.