WAXAL: A large-scale open resource for African language speech technology
Summary
WAXAL is a new, large-scale, open-access speech dataset released by Google Research on March 6, 2026, designed to support African language speech technology. It covers 27 Sub-Saharan African languages spoken by over 100 million people across more than 26 countries. The dataset includes approximately 1,846 hours of transcribed natural speech for Automatic Speech Recognition (ASR) and over 565 hours of high-fidelity recordings for Text-to-Speech (TTS). Released under a Creative Commons CC-BY-4.0 license, WAXAL aims to bridge the digital divide by providing crucial data for low-resource languages, enabling the development of robust, inclusive voice-enabled technologies tailored to Africa's linguistic diversity. The project involved multi-year collaboration with African academic and community organizations.
Key takeaway
For AI Engineers and NLP Engineers developing speech technologies for diverse linguistic populations, WAXAL offers a critical, permissively licensed resource. Your teams can leverage this dataset to build more accurate and inclusive ASR and TTS systems for 27 Sub-Saharan African languages, directly addressing the data scarcity challenge. Consider integrating WAXAL to expand your models' linguistic coverage and improve performance in low-resource contexts.
Key insights
WAXAL provides a large, open-access dataset for 27 African languages, fostering inclusive speech technology development.
Principles
- Open access accelerates research.
- Community collaboration ensures relevance.
- Natural speech data improves ASR.
Method
WAXAL-ASR uses image-prompted elicitation for natural, unscripted speech. WAXAL-TTS employs collaborative script drafting and studio recordings for high-fidelity audio.
In practice
- Use WAXAL for ASR model training.
- Utilize WAXAL-TTS for synthetic voice generation.
- Explore image-prompted data collection.
Topics
- African Language Speech Technology
- Automatic Speech Recognition
- Text-to-Speech
- Open-Access Datasets
- Low-Resource Languages
Code references
Best for: AI Engineer, NLP Engineer, AI Scientist, AI Researcher, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The latest research from Google.