fishaudio / fish-speech

2023-10-10 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

Fish Audio S2 is a text-to-speech (TTS) system developed by Fish Audio, trained on over 10 million hours of audio across approximately 50 languages. It utilizes reinforcement learning alignment and a Dual-Autoregressive architecture to produce natural, realistic, and emotionally rich speech. S2 supports fine-grained inline control of prosody and emotion using natural-language tags like "[laugh]" or "[super happy]", and offers native multi-speaker and multi-turn generation. The S2-Pro model, with 4 billion parameters, is available on HuggingFace. Benchmarks show S2 achieving a 0.54% WER (Chinese) and 0.99% WER (English) on Seed-TTS Eval, and a 0.515 posterior mean on the Audio Turing Test. It also boasts an 81.88% win rate on EmergentTTS-Eval and supports production streaming with SGLang, achieving a Real-Time Factor (RTF) of 0.195 and throughput of 3,000+ acoustic tokens/s on an NVIDIA H200 GPU.

Key takeaway

For AI Engineers building advanced speech synthesis applications, Fish Audio S2 offers a robust solution for generating highly natural, emotionally expressive, and multilingual speech. Its fine-grained control via natural language tags and efficient streaming capabilities with SGLang can significantly enhance user experience and reduce inference latency. Consider integrating S2-Pro to achieve superior benchmark performance and rapid voice cloning with minimal reference audio.

Key insights

Fish Audio S2 is a multilingual TTS system offering fine-grained control and high fidelity via a Dual-AR architecture and RL alignment.

Principles

Dual-Autoregressive architecture optimizes inference efficiency and audio fidelity.
Reinforcement learning alignment improves semantic accuracy and instruction adherence.
Natural language tags enable granular prosody and emotion control.

Method

S2 employs a Dual-Autoregressive architecture with a Slow AR for semantic codebooks and a Fast AR for residual codebooks, enhanced by Group Relative Policy Optimization (GRPO) for post-training alignment.

In practice

Use S2 for high-quality multilingual TTS without phoneme preprocessing.
Employ natural-language tags for precise emotional and prosodic control.
Leverage SGLang for efficient, low-latency production streaming.

Topics

Text-to-Speech
Dual-Autoregressive Architecture
Reinforcement Learning
Multilingual Speech Synthesis
Voice Cloning

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.