A Community Roadmap for Including Kapampángan in Modern AI Research
Summary
Kapampángan, a major Philippine language spoken by over 2 million people, is significantly underrepresented in modern AI research, leading to poor performance in tools like ChatGPT and Facebook translations. While included in Glot500 and having a Wikipedia with 10,000 articles, it lacks dedicated, verified datasets crucial for Natural Language Processing (NLP) research. Benchmarks like Meta's FLORES-200 and NLLB models exclude Kapampángan, unlike five other Philippine languages. This exclusion stems from a scarcity of Kapampángan content online, driven by cultural, economic, and systemic factors. The article proposes a community-led roadmap focusing on building high-quality, structured data, such as parallel corpora, evaluation benchmarks, and lexical resources, to enable rigorous NLP research and improve real-world applications.
Key takeaway
For AI researchers and language preservationists focused on low-resource languages, this analysis highlights the critical need for community-led data initiatives. You should prioritize creating high-quality, verified parallel corpora and evaluation benchmarks, making them openly available on platforms like Hugging Face. This proactive approach, exemplified by Masakhane, is the most impactful way to enable rigorous NLP research and ensure your language's inclusion in future AI advancements, rather than waiting for external corporate support.
Key insights
Community-driven data creation is essential for including underrepresented languages in AI research and applications.
Principles
- Verified, structured data is foundational for NLP research.
- Small, high-quality datasets can enable meaningful NLP work.
- Openly sharing data is critical for global research accessibility.
Method
The proposed method involves building parallel corpora, evaluation benchmarks, and monolingual text, then fine-tuning existing multilingual models, and partnering with institutions to publish open datasets.
In practice
- Contribute Kapampángan↔English sentence pairs.
- Create a 1,000-sentence evaluation benchmark.
- Publish datasets on Hugging Face or GitHub.
Topics
- Low-Resource Languages
- Natural Language Processing
- Machine Translation
- Data Curation
- Community-driven AI
Code references
Best for: AI Researcher, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.