A Community Roadmap for Including Kapampángan in Modern AI Research

· Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

Kapampángan, a major Philippine language spoken by over 2 million people, is significantly underrepresented in modern AI research, leading to poor performance in tools like ChatGPT and Facebook translations. While included in Glot500 and having a Wikipedia with 10,000 articles, it lacks dedicated, verified datasets crucial for Natural Language Processing (NLP) research. Benchmarks like Meta's FLORES-200 and NLLB models exclude Kapampángan, unlike five other Philippine languages. This exclusion stems from a scarcity of Kapampángan content online, driven by cultural, economic, and systemic factors. The article proposes a community-led roadmap focusing on building high-quality, structured data, such as parallel corpora, evaluation benchmarks, and lexical resources, to enable rigorous NLP research and improve real-world applications.

Key takeaway

For AI researchers and language preservationists focused on low-resource languages, this analysis highlights the critical need for community-led data initiatives. You should prioritize creating high-quality, verified parallel corpora and evaluation benchmarks, making them openly available on platforms like Hugging Face. This proactive approach, exemplified by Masakhane, is the most impactful way to enable rigorous NLP research and ensure your language's inclusion in future AI advancements, rather than waiting for external corporate support.

Key insights

Community-driven data creation is essential for including underrepresented languages in AI research and applications.

Principles

Method

The proposed method involves building parallel corpora, evaluation benchmarks, and monolingual text, then fine-tuning existing multilingual models, and partnering with institutions to publish open datasets.

In practice

Topics

Code references

Best for: AI Researcher, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.