Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

A unified pipeline has been developed to synthesize high-quality Quechua and Spanish speech for the Peruvian Constitution, utilizing three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. The models were trained on independent Spanish and Quechua speech datasets of varying sizes and recording conditions, leveraging bilingual and multilingual TTS capabilities to enhance synthesis quality in both languages. This framework addresses data scarcity in Quechua through cross-lingual transfer while maintaining naturalness in Spanish. The project releases trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This initiative aims to develop inclusive TTS systems for political and legal content in low-resource settings.

Key takeaway

For research scientists developing speech technologies for indigenous languages, this work demonstrates that high-quality, intelligible speech can be generated for low-resource languages like Quechua by leveraging cross-lingual transfer from high-resource languages such as Spanish. You should prioritize architectural design over model scale and consider DiFlow-TTS for its superior performance in balancing model size and synthesis quality, especially when data scarcity is a primary concern.

Key insights

Cross-lingual transfer in TTS effectively mitigates data scarcity for low-resource languages like Quechua.

Principles

Cross-lingual learning outperforms model scaling in low-resource TTS.
Architectural design is critical for efficient prosodic transfer.

Method

The method involves training XTTS v2, F5-TTS, and DiFlow-TTS on curated Quechua (40 hours) and Spanish (218 hours) corpora, applying duration-based filtering and morphological normalization for Quechua, and evaluating with UTMOS, SIM-O, WER, RMSEF0, and RMSEE.

In practice

Use DiFlow-TTS for optimal quality in low-resource TTS.
Employ bilingual training for data-scarce languages.

Topics

Low-Resource Text-to-Speech
Quechua Language
Spanish Language
Peruvian Constitution
Cross-lingual Transfer

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.