AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
Summary
AfriVoices-KE is a new large-scale multilingual speech dataset offering approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. It includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across various regions and demographics. The dataset addresses the significant underrepresentation of African languages in speech technology. Data collection involved both scripted recordings from compiled text corpora and unscripted speech elicited via textual and image prompts, utilizing a customized mobile application for contributors. Quality assurance included automated signal-to-noise ratio validation and human content review. Despite challenges like unreliable infrastructure and device compatibility, the project successfully created a foundational resource for inclusive speech technology and linguistic preservation.
Key takeaway
For NLP Engineers and AI Scientists developing speech technologies for African languages, AfriVoices-KE offers a critical resource to build more inclusive automatic speech recognition and text-to-speech systems. Your work can directly benefit from this dataset to improve model performance and address linguistic bias. Consider integrating this dataset into your training pipelines to enhance representation for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali.
Key insights
AfriVoices-KE provides a large, diverse speech dataset for five underrepresented Kenyan languages.
Principles
- Diverse data collection improves linguistic representation.
- Community engagement is vital for low-resource data projects.
Method
Scripted speech was collected from translated text corpora, while spontaneous speech used textual/image prompts via a custom mobile app, with multi-layer quality assurance.
In practice
- Use mobile apps for distributed data collection.
- Implement multi-layered quality assurance.
- Partner with local mobilizers for community trust.
Topics
- AfriVoices-KE
- Multilingual Speech Dataset
- Kenyan Languages
- Automatic Speech Recognition
- Text-to-Speech
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.