UniVoice: A Unified Model for Speech and Singing Voice Generation

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

UniVoice is a unified speech and singing voice generation framework based on conditional flow matching and a Diffusion Transformer (DiT) backbone. Developed by Giant Network and Shanghai Conservatory of Music, it addresses the challenge of generating both natural speech and controllable singing from symbolic inputs. The model factorizes conditioning into content, melody, and timbre. For speech, it uses a learned null melody token to infer prosody, while MIDI note sequences explicitly control singing. Trained on 30k hours of speech and 35k hours of singing data, the 0.3B parameter model achieves a speech PER of 5.26%, comparable to F5-TTS (5.21%), and a singing PER of 16.22%, outperforming Vevo1.5 (24.72%). UniVoice also introduces UniSinging-Eval, a benchmark covering 12 musical styles.

Key takeaway

For AI scientists and ML engineers developing advanced voice synthesis, UniVoice offers a robust approach to unified speech and singing generation. Its factorized conditioning and null melody token design resolve critical conflicts, enabling a single 0.3B parameter model to achieve competitive speech quality and superior singing performance. Consider adopting this architecture to build systems that maintain consistent vocal identity across diverse modalities, streamlining development and reducing model complexity.

Key insights

UniVoice unifies speech and singing generation by factorizing conditioning and using a null melody token for speech.

Principles

Factorized conditioning reduces negative gradient correlation.
Learned null tokens optimize melody-absent representation.
Shared backbones enable positive transfer across modalities.

Method

UniVoice uses Conditional Flow Matching with a Diffusion Transformer (DiT) backbone. It factorizes input conditions (content, melody, timbre) and employs a learned null melody token for speech, modulating the shared backbone via AdaLN.

In practice

Achieve consistent vocal identity across speech and singing.
Perform zero-shot voice cloning for both modalities.
Evaluate unified models with UniSinging-Eval benchmark.

Topics

Speech Synthesis
Singing Voice Synthesis
Conditional Flow Matching
Diffusion Transformer
Factorized Conditioning
Zero-shot Voice Cloning
UniSinging-Eval

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.