fishaudio / fish-speech
Summary
Fish Audio S2 is a text-to-speech (TTS) system developed by Fish Audio, trained on over 10 million hours of audio across approximately 50 languages. It utilizes reinforcement learning alignment and a Dual-Autoregressive architecture to produce natural, realistic, and emotionally rich speech. S2 supports fine-grained inline control of prosody and emotion using natural-language tags like "[laugh]" or "[super happy]", and offers native multi-speaker and multi-turn generation. The S2-Pro model, with 4 billion parameters, is available on HuggingFace. Benchmarks show S2 achieving a 0.54% WER (Chinese) and 0.99% WER (English) on Seed-TTS Eval, and a 0.515 posterior mean on the Audio Turing Test. It also boasts an 81.88% win rate on EmergentTTS-Eval and supports production streaming with SGLang, achieving a Real-Time Factor (RTF) of 0.195 and throughput of 3,000+ acoustic tokens/s on an NVIDIA H200 GPU.
Key takeaway
For AI Engineers building advanced speech synthesis applications, Fish Audio S2 offers a robust solution for generating highly natural, emotionally expressive, and multilingual speech. Its fine-grained control via natural language tags and efficient streaming capabilities with SGLang can significantly enhance user experience and reduce inference latency. Consider integrating S2-Pro to achieve superior benchmark performance and rapid voice cloning with minimal reference audio.
Key insights
Fish Audio S2 is a multilingual TTS system offering fine-grained control and high fidelity via a Dual-AR architecture and RL alignment.
Principles
- Dual-Autoregressive architecture optimizes inference efficiency and audio fidelity.
- Reinforcement learning alignment improves semantic accuracy and instruction adherence.
- Natural language tags enable granular prosody and emotion control.
Method
S2 employs a Dual-Autoregressive architecture with a Slow AR for semantic codebooks and a Fast AR for residual codebooks, enhanced by Group Relative Policy Optimization (GRPO) for post-training alignment.
In practice
- Use S2 for high-quality multilingual TTS without phoneme preprocessing.
- Employ natural-language tags for precise emotional and prosodic control.
- Leverage SGLang for efficient, low-latency production streaming.
Topics
- Text-to-Speech
- Dual-Autoregressive Architecture
- Reinforcement Learning
- Multilingual Speech Synthesis
- Voice Cloning
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.