Certas Palavras: A 1980s-90s Brazilian Radio Corpus to Test TTS Models in Noisy Multi-Speaker Dialogues

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

Certas Palavras (CP) is a new Brazilian Portuguese speech corpus comprising 70 hours of spontaneous, multi-speaker radio dialogues recorded between the 1980s and 1990s. This dataset addresses the scarcity of real-world, noisy speech data for training robust text-to-speech (TTS) systems, as existing Brazilian Portuguese datasets often feature clean, scripted, or studio-recorded audio. The CP corpus includes extensive manual annotations for conversational dynamics, such as orality markers, filled pauses, and hesitations, alongside non-verbal phenomena like musical interference, noise, and segmental corrections inherent to its analog source. Baseline YourTTS and F5-TTS models trained on a 9-hour single-speaker subset achieved intelligible synthesized speech with moderate Word Error Rate (WER) and Character Error Rate (CER). However, subjective evaluations showed a significant gap in naturalness, indicated by lower Mean Opinion Score (MOS) and higher inter-rater variability compared to ground-truth audio, positioning CP as a challenging benchmark for TTS robustness.

Key takeaway

For research scientists developing robust text-to-speech systems, Certas Palavras provides a critical benchmark for evaluating model performance in real-world, noisy, and spontaneous dialogue conditions. You should consider integrating this dataset into your training and evaluation pipelines to assess how well your models generalize beyond clean, scripted speech, particularly for Brazilian Portuguese applications. This will highlight areas for improvement in naturalness and robustness under challenging acoustic environments.

Key insights

Certas Palavras offers a challenging Brazilian Portuguese dataset for robust TTS training in noisy, spontaneous environments.

Principles

Method

The dataset was created by manually annotating 70 hours of 1980s-90s Brazilian radio dialogues for conversational dynamics and non-verbal phenomena like noise and musical interference.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.