A Guide to Voice Cloning on Voxtral with a Missing Encoder
Summary
Mistral recently released Voxtral-4B-TTS, a 4-billion parameter text-to-speech model that reportedly outperforms ElevenLabs v2.5 Flash in internal tests. The model features an autoregressive 3B LLM backbone and an audio autoencoder, Voxtral Codec, which generates 37 discrete tokens for each 80ms audio frame, enabling native audio streaming. While Mistral initially announced voice cloning capabilities and published model weights, they omitted the encoder weights for the audio autoencoder, limiting users to pre-defined voices. This analysis details the Voxtral TTS architecture, investigates the audio autoencoder's function, and explores a method to reconstruct audio codes for voice cloning despite the missing encoder, using techniques like Coordinate Descent and gradient-based optimization with additional STFT and speaker diarization losses.
Key takeaway
For ML engineers developing custom TTS solutions, understanding Voxtral-4B-TTS's architecture is crucial. While direct voice cloning is hindered by missing encoder weights, you can still reconstruct audio codes using gradient-based methods with additional spectral and speaker embedding losses. This approach allows you to approximate voice cloning, expanding the model's utility beyond its initial limitations and potentially enabling custom voice integration.
Key insights
Voxtral-4B-TTS offers high-quality TTS, but its voice cloning is limited by missing encoder weights.
Principles
- Discrete audio tokens enable native streaming.
- Semantic tokens do not directly represent words.
- Overfitting can be an objective for code reconstruction.
Method
Reconstruct audio codes by initializing a `nn.Parameter` layer, applying straight-through estimators for discrete tokens, and training with L1, Short-Time Fourier Transform, and speaker diarization losses.
In practice
- Use Coordinate Descent to extract codes from reference voice embeddings.
- Implement straight-through estimators for discrete token optimization.
- Incorporate STFT and speaker diarization losses for better audio reconstruction.
Topics
- Voxtral-4B-TTS
- Voice Cloning
- Audio Autoencoder
- Missing Encoder
- Straight-Through Estimator
Code references
Best for: AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.