UniVoice: A Unified Model for Speech and Singing Voice Generation
Summary
UniVoice is a unified speech and singing voice generation framework based on conditional flow matching and a Diffusion Transformer (DiT) backbone. Developed by Giant Network and Shanghai Conservatory of Music, it addresses the challenge of generating both natural speech and controllable singing from symbolic inputs. The model factorizes conditioning into content, melody, and timbre. For speech, it uses a learned null melody token to infer prosody, while MIDI note sequences explicitly control singing. Trained on 30k hours of speech and 35k hours of singing data, the 0.3B parameter model achieves a speech PER of 5.26%, comparable to F5-TTS (5.21%), and a singing PER of 16.22%, outperforming Vevo1.5 (24.72%). UniVoice also introduces UniSinging-Eval, a benchmark covering 12 musical styles.
Key takeaway
For AI scientists and ML engineers developing advanced voice synthesis, UniVoice offers a robust approach to unified speech and singing generation. Its factorized conditioning and null melody token design resolve critical conflicts, enabling a single 0.3B parameter model to achieve competitive speech quality and superior singing performance. Consider adopting this architecture to build systems that maintain consistent vocal identity across diverse modalities, streamlining development and reducing model complexity.
Key insights
UniVoice unifies speech and singing generation by factorizing conditioning and using a null melody token for speech.
Principles
- Factorized conditioning reduces negative gradient correlation.
- Learned null tokens optimize melody-absent representation.
- Shared backbones enable positive transfer across modalities.
Method
UniVoice uses Conditional Flow Matching with a Diffusion Transformer (DiT) backbone. It factorizes input conditions (content, melody, timbre) and employs a learned null melody token for speech, modulating the shared backbone via AdaLN.
In practice
- Achieve consistent vocal identity across speech and singing.
- Perform zero-shot voice cloning for both modalities.
- Evaluate unified models with UniSinging-Eval benchmark.
Topics
- Speech Synthesis
- Singing Voice Synthesis
- Conditional Flow Matching
- Diffusion Transformer
- Factorized Conditioning
- Zero-shot Voice Cloning
- UniSinging-Eval
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.