A Community Roadmap for Including Kapampángan in Modern AI Research

2026-03-21 · Source: NLP on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

Kapampángan, a major Philippine language spoken by over 2 million people, is significantly underrepresented in modern AI research, leading to poor performance in tools like ChatGPT and Facebook translations. While included in Glot500 and having a Wikipedia with 10,000 articles, it lacks dedicated, verified datasets crucial for Natural Language Processing (NLP) research. Benchmarks like Meta's FLORES-200 and NLLB models exclude Kapampángan, unlike five other Philippine languages. This exclusion stems from a scarcity of Kapampángan content online, driven by cultural, economic, and systemic factors. The article proposes a community-led roadmap focusing on building high-quality, structured data, such as parallel corpora, evaluation benchmarks, and lexical resources, to enable rigorous NLP research and improve real-world applications.

Key takeaway

For AI researchers and language preservationists focused on low-resource languages, this analysis highlights the critical need for community-led data initiatives. You should prioritize creating high-quality, verified parallel corpora and evaluation benchmarks, making them openly available on platforms like Hugging Face. This proactive approach, exemplified by Masakhane, is the most impactful way to enable rigorous NLP research and ensure your language's inclusion in future AI advancements, rather than waiting for external corporate support.

Key insights

Community-driven data creation is essential for including underrepresented languages in AI research and applications.

Principles

Verified, structured data is foundational for NLP research.
Small, high-quality datasets can enable meaningful NLP work.
Openly sharing data is critical for global research accessibility.

Method

The proposed method involves building parallel corpora, evaluation benchmarks, and monolingual text, then fine-tuning existing multilingual models, and partnering with institutions to publish open datasets.

In practice

Contribute Kapampángan↔English sentence pairs.
Create a 1,000-sentence evaluation benchmark.
Publish datasets on Hugging Face or GitHub.

Topics

Low-Resource Languages
Natural Language Processing
Machine Translation
Data Curation
Community-driven AI

Code references

Best for: AI Researcher, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NLP on Medium.