ZONOS2 Technical Report
Summary
ZONOS2 8B is a new text-to-speech (TTS) model that achieves state-of-the-art naturalness, prosody, and voice cloning fidelity. It significantly improves upon its predecessor, Zonos-v0.1, by scaling from 1.6 billion to 8 billion total parameters (900 million active) through a novel Mixture-of-Experts (MoE) backbone, which also enhances inference latency and throughput. The training corpus was expanded from 200,000 to over 6 million hours using a new data processing pipeline. Additionally, post-training and conditioning recipes were simplified to further boost naturalness and voice cloning. Evaluated on quality, speaker similarity, WER, and the ZTTS1-Eval benchmark, ZONOS2 8B performs competitively with other state-of-the-art systems while maintaining good streaming latency. Its model weights and example inference code are released under an Apache 2.0 license on GitHub and Hugging Face.
Key takeaway
For machine learning engineers evaluating text-to-speech solutions, ZONOS2 8B presents a compelling option due to its state-of-the-art naturalness and voice cloning fidelity, coupled with efficient inference via its Mixture-of-Experts architecture. You should consider integrating its Apache 2.0 licensed weights and inference code into your projects, especially for applications requiring high-quality, scalable speech synthesis with good streaming latency. This model offers a robust foundation for advanced voice applications.
Key insights
ZONOS2 8B achieves state-of-the-art TTS by scaling parameters, data, and simplifying training recipes for improved fidelity.
Principles
- MoE backbones enhance inference for large models.
- Extensive data scaling improves TTS quality.
- Simplified training recipes boost naturalness.
Method
The model scales from 1.6B to 8B parameters using a novel Mixture-of-Experts backbone, processes over 6M hours of training data, and simplifies post-training and conditioning recipes.
In practice
- Utilize MoE for efficient, large-scale TTS.
- Leverage ZONOS2 8B for high-fidelity voice cloning.
- Integrate Apache 2.0 licensed weights for TTS applications.
Topics
- Text-to-Speech
- Mixture-of-Experts
- Voice Cloning
- Model Scaling
- Speech Synthesis
- Open-Source Models
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.