Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, quick

Summary

Researchers investigated the quality gap in text-to-speech (TTS) for low-resource languages like Khmer and Korean, using VoxCPM2, a 2.4B-parameter, tokenizer-free model. They applied a single Low-Rank Adaptation (LoRA) adapter, trained on a shared 26-hour language-tagged corpus, to both the MiniCPM-4 language model backbone and the flow-matching diffusion decoder. For Khmer, native-speaker listening tests showed a significant Mean Opinion Score (MOS) increase from 3.85 to 4.23 with the best rank 64 adapter, training only 0.19% to 3.03% of parameters. However, the same adapter yielded no quality gain for Korean, a language the base model already handled well, and even degraded it at higher ranks. This suggests LoRA adaptation is most effective when the base model exhibits genuine weakness, with a notable discrepancy between automatic loss (lowest at rank 128) and human ratings (MOS peak at rank 64).

Key takeaway

For Machine Learning Engineers developing text-to-speech systems for low-resource languages, consider LoRA fine-tuning on models like VoxCPM2 to bridge quality gaps. Your strategy should prioritize languages where the base model performs poorly, as adaptation offers minimal benefit or even degradation for already strong languages. Crucially, rely on human evaluation metrics like Mean Opinion Score (MOS) to determine optimal LoRA ranks, as automatic loss metrics may not align with perceived quality.

Key insights

LoRA fine-tuning significantly improves low-resource TTS quality, but effectiveness depends on base model weakness.

Principles

LoRA adaptation targets base model weaknesses.
Human ratings can diverge from automatic loss.
Zero-initialization preserves original model state.

Method

Adapt VoxCPM2 with a single LoRA adapter, trained on a shared, language-tagged corpus, applied to both LM and diffusion decoder.

In practice

Use LoRA for low-resource TTS quality gaps.
Evaluate LoRA ranks with human listening tests.
Combine languages in shared adaptation corpus.

Topics

Text-to-Speech
Low-Resource NLP
LoRA Fine-tuning
VoxCPM2
Khmer Language
Mean Opinion Score

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.