KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026
Summary
KIT's submission to the IWSLT 2026 Cross-Lingual Voice Cloning track introduces methods to improve multilingual text-to-speech (TTS) systems, specifically addressing accent leakage and domain-specific term mispronunciation. Building on the FishAudio-S2-Pro model, the team implemented language tag prompting, reinforcement learning (RL) fine-tuning using Group Relative Policy Optimization (GRPO), and reference-conditioned lexical matching. Evaluated on the ACL 60/60 dataset for French, Arabic, and Chinese target languages from English reference speech, results indicate that language prompting provides the largest gains, significantly reducing English pronunciation bias and improving target language identification. Lexical matching consistently enhances pronunciation of domain-specific terms, while RL fine-tuning further refines pronunciation consistency and stabilizes overall performance without degrading speaker similarity or perceptual quality.
Key takeaway
For NLP Engineers developing cross-lingual TTS systems, this research offers clear strategies to enhance output quality. You should integrate explicit language tag prompting, especially native-script tags, to reduce accent leakage and improve target language consistency. Consider applying reinforcement learning fine-tuning to stabilize performance and implement reference-conditioned lexical matching for accurate domain-specific term pronunciation. These methods can significantly improve intelligibility and naturalness in multilingual speech generation.
Key insights
Explicit language tags and lexical matching significantly improve cross-lingual voice cloning performance and reduce accent leakage.
Principles
- Language tags reduce cross-lingual pronunciation drift.
- Lexical matching improves domain-specific term pronunciation.
- RL fine-tuning stabilizes cross-lingual TTS performance.
Method
Build on FishAudio-S2-Pro, apply language tag prompting, then RL fine-tuning with GRPO using CER and SSIM rewards, and finally, reference-conditioned lexical matching for domain terms.
In practice
- Use native-script language tags for stronger conditioning.
- Employ RL fine-tuning to adapt TTS models.
- Implement lexical matching for domain-specific terms.
Topics
- Cross-Lingual Voice Cloning
- Text-to-Speech
- Reinforcement Learning
- Language Tagging
- Lexical Matching
- FishAudio-S2-Pro
- IWSLT 2026
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.