UniVoice: A Unified Model for Speech and Singing Voice Generation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech & Audio Generation · Depth: Expert, quick

Summary

UniVoice is a novel unified model designed for both speech and singing voice generation, addressing the distinct requirements of text-to-speech (TTS) and singing voice synthesis (SVS). Traditional models struggle with the mismatch between speech's flexible, language-driven prosody and singing's need for explicit melody control and rhythmic alignment. UniVoice tackles this by employing a conditional flow matching framework that factorizes input conditions into content, melody, and timbre. These are processed by modality-appropriate encoders and fed into a shared Diffusion Transformer (DiT) backbone. A key innovation is the use of a learned null melody token for speech, enabling the model to infer prosody contextually, while MIDI note sequences provide precise melody control for singing. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26%, comparable to dedicated TTS systems like F5-TTS (5.21%) and CosyVoice3 (5.30%). For singing generation, it achieves a PER of 16.22%, outperforming the unified baseline Vevo1.5 (24.72%).

Key takeaway

For Machine Learning Engineers developing unified voice generation systems, UniVoice demonstrates a robust approach to handling disparate modality requirements. You should consider implementing conditional factorization and null melody tokens to achieve both flexible speech prosody and precise singing control within a single model. This method allows you to avoid maintaining separate TTS and SVS pipelines, streamlining development and deployment while achieving competitive performance across both domains.

Key insights

UniVoice unifies speech and singing voice generation through conditional factorization and a null melody token for speech prosody inference.

Principles

Method

UniVoice employs conditional flow matching, factorizing conditions into content, melody, and timbre. Modality-specific encoders feed a shared Diffusion Transformer (DiT) backbone. Speech uses a learned null melody token, while singing uses MIDI note sequences.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.