KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

KIT's submission to the IWSLT 2026 Cross-Lingual Voice Cloning track introduces methods to improve multilingual text-to-speech (TTS) systems, specifically addressing accent leakage and domain-specific term mispronunciation. Building on the FishAudio-S2-Pro model, the team implemented language tag prompting, reinforcement learning (RL) fine-tuning using Group Relative Policy Optimization (GRPO), and reference-conditioned lexical matching. Evaluated on the ACL 60/60 dataset for French, Arabic, and Chinese target languages from English reference speech, results indicate that language prompting provides the largest gains, significantly reducing English pronunciation bias and improving target language identification. Lexical matching consistently enhances pronunciation of domain-specific terms, while RL fine-tuning further refines pronunciation consistency and stabilizes overall performance without degrading speaker similarity or perceptual quality.

Key takeaway

For NLP Engineers developing cross-lingual TTS systems, this research offers clear strategies to enhance output quality. You should integrate explicit language tag prompting, especially native-script tags, to reduce accent leakage and improve target language consistency. Consider applying reinforcement learning fine-tuning to stabilize performance and implement reference-conditioned lexical matching for accurate domain-specific term pronunciation. These methods can significantly improve intelligibility and naturalness in multilingual speech generation.

Key insights

Explicit language tags and lexical matching significantly improve cross-lingual voice cloning performance and reduce accent leakage.

Principles

Language tags reduce cross-lingual pronunciation drift.
Lexical matching improves domain-specific term pronunciation.
RL fine-tuning stabilizes cross-lingual TTS performance.

Method

Build on FishAudio-S2-Pro, apply language tag prompting, then RL fine-tuning with GRPO using CER and SSIM rewards, and finally, reference-conditioned lexical matching for domain terms.

In practice

Use native-script language tags for stronger conditioning.
Employ RL fine-tuning to adapt TTS models.
Implement lexical matching for domain-specific terms.

Topics

Cross-Lingual Voice Cloning
Text-to-Speech
Reinforcement Learning
Language Tagging
Lexical Matching
FishAudio-S2-Pro
IWSLT 2026

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.