Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus
Summary
A unified pipeline has been developed to synthesize high-quality Quechua and Spanish speech for the Peruvian Constitution, utilizing three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. The models were trained on independent Spanish and Quechua speech datasets of varying sizes and recording conditions, leveraging bilingual and multilingual TTS capabilities to enhance synthesis quality in both languages. This framework addresses data scarcity in Quechua through cross-lingual transfer while maintaining naturalness in Spanish. The project releases trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This initiative aims to develop inclusive TTS systems for political and legal content in low-resource settings.
Key takeaway
For research scientists developing speech technologies for indigenous languages, this work demonstrates that high-quality, intelligible speech can be generated for low-resource languages like Quechua by leveraging cross-lingual transfer from high-resource languages such as Spanish. You should prioritize architectural design over model scale and consider DiFlow-TTS for its superior performance in balancing model size and synthesis quality, especially when data scarcity is a primary concern.
Key insights
Cross-lingual transfer in TTS effectively mitigates data scarcity for low-resource languages like Quechua.
Principles
- Cross-lingual learning outperforms model scaling in low-resource TTS.
- Architectural design is critical for efficient prosodic transfer.
Method
The method involves training XTTS v2, F5-TTS, and DiFlow-TTS on curated Quechua (40 hours) and Spanish (218 hours) corpora, applying duration-based filtering and morphological normalization for Quechua, and evaluating with UTMOS, SIM-O, WER, RMSEF0, and RMSEE.
In practice
- Use DiFlow-TTS for optimal quality in low-resource TTS.
- Employ bilingual training for data-scarce languages.
Topics
- Low-Resource Text-to-Speech
- Quechua Language
- Spanish Language
- Peruvian Constitution
- Cross-lingual Transfer
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.