UniVoice: A Unified Model for Speech and Singing Voice Generation
Summary
UniVoice is a novel unified model designed for both speech and singing voice generation, addressing the distinct requirements of text-to-speech (TTS) and singing voice synthesis (SVS). Traditional models struggle with the mismatch between speech's flexible, language-driven prosody and singing's need for explicit melody control and rhythmic alignment. UniVoice tackles this by employing a conditional flow matching framework that factorizes input conditions into content, melody, and timbre. These are processed by modality-appropriate encoders and fed into a shared Diffusion Transformer (DiT) backbone. A key innovation is the use of a learned null melody token for speech, enabling the model to infer prosody contextually, while MIDI note sequences provide precise melody control for singing. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26%, comparable to dedicated TTS systems like F5-TTS (5.21%) and CosyVoice3 (5.30%). For singing generation, it achieves a PER of 16.22%, outperforming the unified baseline Vevo1.5 (24.72%).
Key takeaway
For Machine Learning Engineers developing unified voice generation systems, UniVoice demonstrates a robust approach to handling disparate modality requirements. You should consider implementing conditional factorization and null melody tokens to achieve both flexible speech prosody and precise singing control within a single model. This method allows you to avoid maintaining separate TTS and SVS pipelines, streamlining development and deployment while achieving competitive performance across both domains.
Key insights
UniVoice unifies speech and singing voice generation through conditional factorization and a null melody token for speech prosody inference.
Principles
- Condition factorization enables unified vocal synthesis.
- Null tokens can marginalize specific conditioning requirements.
- Shared Diffusion Transformers support diverse vocal tasks.
Method
UniVoice employs conditional flow matching, factorizing conditions into content, melody, and timbre. Modality-specific encoders feed a shared Diffusion Transformer (DiT) backbone. Speech uses a learned null melody token, while singing uses MIDI note sequences.
In practice
- Factorize conditions for multi-modal generative models.
- Implement null tokens for flexible condition marginalization.
- Apply Diffusion Transformers to unified audio generation.
Topics
- UniVoice
- Speech Generation
- Singing Voice Synthesis
- Conditional Flow Matching
- Diffusion Transformers
- Null Melody Token
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.