AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

2026-04-09 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

AfriVoices-KE is a new large-scale multilingual speech dataset offering approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. It includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across various regions and demographics. The dataset addresses the significant underrepresentation of African languages in speech technology. Data collection involved both scripted recordings from compiled text corpora and unscripted speech elicited via textual and image prompts, utilizing a customized mobile application for contributors. Quality assurance included automated signal-to-noise ratio validation and human content review. Despite challenges like unreliable infrastructure and device compatibility, the project successfully created a foundational resource for inclusive speech technology and linguistic preservation.

Key takeaway

For NLP Engineers and AI Scientists developing speech technologies for African languages, AfriVoices-KE offers a critical resource to build more inclusive automatic speech recognition and text-to-speech systems. Your work can directly benefit from this dataset to improve model performance and address linguistic bias. Consider integrating this dataset into your training pipelines to enhance representation for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali.

Key insights

AfriVoices-KE provides a large, diverse speech dataset for five underrepresented Kenyan languages.

Principles

Diverse data collection improves linguistic representation.
Community engagement is vital for low-resource data projects.

Method

Scripted speech was collected from translated text corpora, while spontaneous speech used textual/image prompts via a custom mobile app, with multi-layer quality assurance.

In practice

Use mobile apps for distributed data collection.
Implement multi-layered quality assurance.
Partner with local mobilizers for community trust.

Topics

AfriVoices-KE
Multilingual Speech Dataset
Kenyan Languages
Automatic Speech Recognition
Text-to-Speech

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.