ZONOS2 Technical Report
Summary
ZONOS2 8B, a new Text-to-Speech (TTS) model, is presented, achieving state-of-the-art naturalness, prosody, and voice cloning fidelity. This model scales from Zonos-v0.1's 1.6B to 8B total parameters (900M active) by incorporating a novel mixture-of-experts (MoE) backbone, which enhances inference latency and throughput. Its training corpus expanded significantly from 200K to over 6M hours through a new data processing pipeline. Additionally, simplified post-training and conditioning recipes contribute to improved naturalness and voice cloning. ZONOS2 8B was evaluated on quality, speaker similarity, WER, and the ZTTS1-Eval benchmark, demonstrating competitive performance against other state-of-the-art systems while maintaining good streaming latency. The model weights and example inference code are released under an Apache 2.0 license on GitHub and Hugging Face.
Key takeaway
For AI Scientists or Machine Learning Engineers evaluating advanced TTS solutions, ZONOS2 8B offers a compelling, open-source option. Its state-of-the-art naturalness, prosody, and voice cloning fidelity, combined with improved inference efficiency from its MoE architecture, make it a strong candidate for integration. You should consider leveraging its Apache 2.0 licensed weights and code for your next-generation speech synthesis projects, especially where performance and quality are critical.
Key insights
ZONOS2 8B achieves SOTA TTS performance through MoE scaling and massive data expansion.
Principles
- Scaling with MoE improves inference efficiency.
- Large, clean data enhances TTS quality.
- Simplified recipes boost naturalness and fidelity.
Method
ZONOS2 8B was developed by scaling from 1.6B to 8B parameters using a novel MoE backbone, expanding training data to 6M hours via a new pipeline, and simplifying post-training recipes.
In practice
- Utilize MoE for large-scale TTS models.
- Prioritize data pipeline for quality gains.
- Simplify post-training for fidelity improvements.
Topics
- Text-to-Speech
- ZONOS2
- Mixture-of-Experts
- Voice Cloning
- Model Scaling
- Apache 2.0 License
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.