UniVoice: A Unified Model for Speech and Singing Voice Generation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

UniVoice is a unified speech and singing voice generation framework based on conditional flow matching and a Diffusion Transformer (DiT) backbone. Developed by Giant Network and Shanghai Conservatory of Music, it addresses the challenge of generating both natural speech and controllable singing from symbolic inputs. The model factorizes conditioning into content, melody, and timbre. For speech, it uses a learned null melody token to infer prosody, while MIDI note sequences explicitly control singing. Trained on 30k hours of speech and 35k hours of singing data, the 0.3B parameter model achieves a speech PER of 5.26%, comparable to F5-TTS (5.21%), and a singing PER of 16.22%, outperforming Vevo1.5 (24.72%). UniVoice also introduces UniSinging-Eval, a benchmark covering 12 musical styles.

Key takeaway

For AI scientists and ML engineers developing advanced voice synthesis, UniVoice offers a robust approach to unified speech and singing generation. Its factorized conditioning and null melody token design resolve critical conflicts, enabling a single 0.3B parameter model to achieve competitive speech quality and superior singing performance. Consider adopting this architecture to build systems that maintain consistent vocal identity across diverse modalities, streamlining development and reducing model complexity.

Key insights

UniVoice unifies speech and singing generation by factorizing conditioning and using a null melody token for speech.

Principles

Method

UniVoice uses Conditional Flow Matching with a Diffusion Transformer (DiT) backbone. It factorizes input conditions (content, melody, timbre) and employs a learned null melody token for speech, modulating the shared backbone via AdaLN.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.