Multilingual Emotional Voice Synthesis for Brazilian Portuguese: A Comparative Analysis of Fine-Tuning Approaches

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study investigated multilingual emotional voice synthesis for Brazilian Portuguese, a field with limited prior research. Researchers compared five distinct approaches to integrate emotional control into Portuguese-English multilingual synthesis: the base YourTTS model, fine-tuning with emotional data, conditioning through textual tokens, and two configurations of the VECL-TTS architecture utilizing emotional embeddings. The fine-tuning process involved 14.4 hours of emotional datasets, including English (RAVDESS, Emotional Speech Dataset) and Brazilian Portuguese (VERBO), applied to a pre-trained YourTTS model. Evaluation combined objective metrics, such as emotional and speaker embedding similarity, with subjective assessments from ten participants. The findings indicate that simpler architectural methods can achieve perceptual quality on par with or better than more intricate ones.

Key takeaway

For research scientists developing multilingual emotional voice synthesis systems for under-resourced languages like Brazilian Portuguese, you should prioritize fine-tuning pre-trained models. Simpler fine-tuning approaches can yield perceptual quality comparable to or superior to more complex architectural methods, offering an efficient path to viable emotional transfer with limited resources. Consider the trade-off between emotional control and vocal identity preservation in your model design.

Key insights

Simpler fine-tuning methods can achieve high-quality multilingual emotional voice synthesis for Brazilian Portuguese.

Principles

Emotional control competes with vocal identity.
Objective metrics may diverge from human perception.

Method

Fine-tuning a pre-trained YourTTS model with combined English and Brazilian Portuguese emotional datasets (RAVDESS, Emotional Speech Dataset, VERBO) to enable multilingual emotional voice synthesis.

In practice

Use YourTTS with fine-tuning for overall quality.
Employ textual token conditioning for emotional similarity.

Topics

Multilingual Voice Synthesis
Emotional Speech Synthesis
Brazilian Portuguese
YourTTS
Fine-tuning

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.