Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

A systematic analysis of multilingual multi-speaker unit vocoders, specifically a BigVGAN-based architecture, investigates the impact of discrete speech representations across four Indian languages: Bengali, Hindi, Tamil, and Telugu. The study explores how cluster size (ranging from 500 to 10k) and conditioning strategies (speaker, language, or combined) influence speech generation quality. Findings indicate that cluster size primarily dictates intelligibility, with larger clusters improving phonetic discriminability and reducing Word Error Rate (WER). Explicit speaker conditioning, using ECAPA-TDNN embeddings, is indispensable for preserving speaker identity, increasing speaker similarity by 4-5x. Language supervision, via an auxiliary Language Identification (LID) objective, offers gains mainly at smaller cluster sizes where units are ambiguous, but its effectiveness diminishes with larger, more discriminative unit inventories. Smaller clusters also show significant cross-lingual phoneme sharing, which larger clusters progressively separate.

Key takeaway

For Machine Learning Engineers developing multilingual speech synthesis or Audio LLM systems, prioritize explicit speaker conditioning using continuous embeddings like ECAPA-TDNN to prevent speaker identity collapse. Simultaneously, carefully select unit cluster sizes; larger inventories (e.g., 10k) improve phonetic resolution and intelligibility, while language supervision is most beneficial for smaller, more ambiguous unit sets. This approach ensures robust speaker preservation and high linguistic clarity across diverse languages.

Key insights

Unit vocoder performance in multilingual settings hinges on cluster size for intelligibility and explicit speaker conditioning for identity preservation.

Principles

Larger unit cluster sizes enhance phonetic discriminability.
Explicit speaker conditioning prevents identity collapse.
Cross-lingual phoneme sharing decreases with larger clusters.

Method

Extend BigVGAN with discrete unit input, optional ECAPA-TDNN speaker embeddings, and language embeddings with an auxiliary LID classifier, trained with adversarial, feature matching, and ℓ₁ mel spectrogram losses.

In practice

Use ECAPA-TDNN for robust speaker conditioning.
Increase cluster size for better intelligibility.
Apply language supervision for ambiguous units.

Topics

Multilingual Speech Synthesis
Unit Vocoders
BigVGAN
Discrete Speech Representations
Speaker Conditioning
Language Identification
k-means Clustering

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.