A Guide to Voice Cloning on Voxtral with a Missing Encoder

2026-04-10 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Speech Technology · Depth: Expert, long

Summary

Mistral recently released Voxtral-4B-TTS, a 4-billion parameter text-to-speech model that reportedly outperforms ElevenLabs v2.5 Flash in internal tests. The model features an autoregressive 3B LLM backbone and an audio autoencoder, Voxtral Codec, which generates 37 discrete tokens for each 80ms audio frame, enabling native audio streaming. While Mistral initially announced voice cloning capabilities and published model weights, they omitted the encoder weights for the audio autoencoder, limiting users to pre-defined voices. This analysis details the Voxtral TTS architecture, investigates the audio autoencoder's function, and explores a method to reconstruct audio codes for voice cloning despite the missing encoder, using techniques like Coordinate Descent and gradient-based optimization with additional STFT and speaker diarization losses.

Key takeaway

For ML engineers developing custom TTS solutions, understanding Voxtral-4B-TTS's architecture is crucial. While direct voice cloning is hindered by missing encoder weights, you can still reconstruct audio codes using gradient-based methods with additional spectral and speaker embedding losses. This approach allows you to approximate voice cloning, expanding the model's utility beyond its initial limitations and potentially enabling custom voice integration.

Key insights

Voxtral-4B-TTS offers high-quality TTS, but its voice cloning is limited by missing encoder weights.

Principles

Discrete audio tokens enable native streaming.
Semantic tokens do not directly represent words.
Overfitting can be an objective for code reconstruction.

Method

Reconstruct audio codes by initializing a `nn.Parameter` layer, applying straight-through estimators for discrete tokens, and training with L1, Short-Time Fourier Transform, and speaker diarization losses.

In practice

Use Coordinate Descent to extract codes from reference voice embeddings.
Implement straight-through estimators for discrete token optimization.
Incorporate STFT and speaker diarization losses for better audio reconstruction.

Topics

Voxtral-4B-TTS
Voice Cloning
Audio Autoencoder
Missing Encoder
Straight-Through Estimator

Code references

Best for: AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.