Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean
Summary
Researchers investigated the quality gap in text-to-speech (TTS) for low-resource languages like Khmer and Korean, using VoxCPM2, a 2.4B-parameter, tokenizer-free model. They applied a single Low-Rank Adaptation (LoRA) adapter, trained on a shared 26-hour language-tagged corpus, to both the MiniCPM-4 language model backbone and the flow-matching diffusion decoder. For Khmer, native-speaker listening tests showed a significant Mean Opinion Score (MOS) increase from 3.85 to 4.23 with the best rank 64 adapter, training only 0.19% to 3.03% of parameters. However, the same adapter yielded no quality gain for Korean, a language the base model already handled well, and even degraded it at higher ranks. This suggests LoRA adaptation is most effective when the base model exhibits genuine weakness, with a notable discrepancy between automatic loss (lowest at rank 128) and human ratings (MOS peak at rank 64).
Key takeaway
For Machine Learning Engineers developing text-to-speech systems for low-resource languages, consider LoRA fine-tuning on models like VoxCPM2 to bridge quality gaps. Your strategy should prioritize languages where the base model performs poorly, as adaptation offers minimal benefit or even degradation for already strong languages. Crucially, rely on human evaluation metrics like Mean Opinion Score (MOS) to determine optimal LoRA ranks, as automatic loss metrics may not align with perceived quality.
Key insights
LoRA fine-tuning significantly improves low-resource TTS quality, but effectiveness depends on base model weakness.
Principles
- LoRA adaptation targets base model weaknesses.
- Human ratings can diverge from automatic loss.
- Zero-initialization preserves original model state.
Method
Adapt VoxCPM2 with a single LoRA adapter, trained on a shared, language-tagged corpus, applied to both LM and diffusion decoder.
In practice
- Use LoRA for low-resource TTS quality gaps.
- Evaluate LoRA ranks with human listening tests.
- Combine languages in shared adaptation corpus.
Topics
- Text-to-Speech
- Low-Resource NLP
- LoRA Fine-tuning
- VoxCPM2
- Khmer Language
- Mean Opinion Score
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.