Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations
Summary
A systematic analysis of multilingual multi-speaker unit vocoders, specifically a BigVGAN-based architecture, investigates the impact of discrete speech representations across four Indian languages: Bengali, Hindi, Tamil, and Telugu. The study explores how cluster size (ranging from 500 to 10k) and conditioning strategies (speaker, language, or combined) influence speech generation quality. Findings indicate that cluster size primarily dictates intelligibility, with larger clusters improving phonetic discriminability and reducing Word Error Rate (WER). Explicit speaker conditioning, using ECAPA-TDNN embeddings, is indispensable for preserving speaker identity, increasing speaker similarity by 4-5x. Language supervision, via an auxiliary Language Identification (LID) objective, offers gains mainly at smaller cluster sizes where units are ambiguous, but its effectiveness diminishes with larger, more discriminative unit inventories. Smaller clusters also show significant cross-lingual phoneme sharing, which larger clusters progressively separate.
Key takeaway
For Machine Learning Engineers developing multilingual speech synthesis or Audio LLM systems, prioritize explicit speaker conditioning using continuous embeddings like ECAPA-TDNN to prevent speaker identity collapse. Simultaneously, carefully select unit cluster sizes; larger inventories (e.g., 10k) improve phonetic resolution and intelligibility, while language supervision is most beneficial for smaller, more ambiguous unit sets. This approach ensures robust speaker preservation and high linguistic clarity across diverse languages.
Key insights
Unit vocoder performance in multilingual settings hinges on cluster size for intelligibility and explicit speaker conditioning for identity preservation.
Principles
- Larger unit cluster sizes enhance phonetic discriminability.
- Explicit speaker conditioning prevents identity collapse.
- Cross-lingual phoneme sharing decreases with larger clusters.
Method
Extend BigVGAN with discrete unit input, optional ECAPA-TDNN speaker embeddings, and language embeddings with an auxiliary LID classifier, trained with adversarial, feature matching, and ℓ₁ mel spectrogram losses.
In practice
- Use ECAPA-TDNN for robust speaker conditioning.
- Increase cluster size for better intelligibility.
- Apply language supervision for ambiguous units.
Topics
- Multilingual Speech Synthesis
- Unit Vocoders
- BigVGAN
- Discrete Speech Representations
- Speaker Conditioning
- Language Identification
- k-means Clustering
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.