Multilingual Emotional Voice Synthesis for Brazilian Portuguese: A Comparative Analysis of Fine-Tuning Approaches
Summary
A study investigated multilingual emotional voice synthesis for Brazilian Portuguese, a field with limited prior research. Researchers compared five distinct approaches to integrate emotional control into Portuguese-English multilingual synthesis: the base YourTTS model, fine-tuning with emotional data, conditioning through textual tokens, and two configurations of the VECL-TTS architecture utilizing emotional embeddings. The fine-tuning process involved 14.4 hours of emotional datasets, including English (RAVDESS, Emotional Speech Dataset) and Brazilian Portuguese (VERBO), applied to a pre-trained YourTTS model. Evaluation combined objective metrics, such as emotional and speaker embedding similarity, with subjective assessments from ten participants. The findings indicate that simpler architectural methods can achieve perceptual quality on par with or better than more intricate ones.
Key takeaway
For research scientists developing multilingual emotional voice synthesis systems for under-resourced languages like Brazilian Portuguese, you should prioritize fine-tuning pre-trained models. Simpler fine-tuning approaches can yield perceptual quality comparable to or superior to more complex architectural methods, offering an efficient path to viable emotional transfer with limited resources. Consider the trade-off between emotional control and vocal identity preservation in your model design.
Key insights
Simpler fine-tuning methods can achieve high-quality multilingual emotional voice synthesis for Brazilian Portuguese.
Principles
- Emotional control competes with vocal identity.
- Objective metrics may diverge from human perception.
Method
Fine-tuning a pre-trained YourTTS model with combined English and Brazilian Portuguese emotional datasets (RAVDESS, Emotional Speech Dataset, VERBO) to enable multilingual emotional voice synthesis.
In practice
- Use YourTTS with fine-tuning for overall quality.
- Employ textual token conditioning for emotional similarity.
Topics
- Multilingual Voice Synthesis
- Emotional Speech Synthesis
- Brazilian Portuguese
- YourTTS
- Fine-tuning
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.