UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling
Summary
UniSAE is a novel unified speech attribute editing framework designed to modify specific portions of an utterance while preserving the rest. Unlike existing approaches that treat content, speaker, and emotion editing as separate tasks, UniSAE supports composable editing of these attributes from sub-phoneme to word level within a single architecture. It introduces a Discrete Phonetic PosteriorGram (DPPG) representation, which factorizes speech content into discrete tokens encoding phoneme identity, pronunciation variants, and duration, enabling direct phoneme- and sub-phoneme-level editing. For higher-level modifications, an autoregressive content transformer predicts edited DPPG sequences for word-level content editing. The framework then uses a diffusion-based acoustic decoder, conditioned on disentangled speaker and emotion representations, to render the edited sequences into speech. Experimental results demonstrate UniSAE's capability for precise speaker and emotion control, content editing at multiple granularities, and joint modification of all three attributes.
Key takeaway
For Audio and Speech Processing Engineers developing advanced speech synthesis or editing systems, UniSAE presents a significant architectural shift. Its unified framework allows you to precisely control speaker identity, emotional tone, and content from sub-phoneme to word level, overcoming limitations of separate task-specific models. Consider integrating discrete phonetic posteriorgram modeling to achieve more flexible and composable speech attribute modifications in your next-generation applications.
Key insights
UniSAE unifies speaker, emotion, and content editing from sub-phoneme to word level using Discrete Phonetic PosteriorGram (DPPG) modeling.
Principles
- Factorize speech content into discrete tokens.
- Unify attribute editing within one architecture.
- Use disentangled representations for control.
Method
UniSAE uses DPPG for sub-phoneme editing, an autoregressive content transformer for word-level DPPG sequences, and a diffusion-based acoustic decoder conditioned on disentangled speaker/emotion representations.
In practice
- Edit speaker identity in speech.
- Modify emotional tone of utterances.
- Adjust specific phonemes or words.
Topics
- Speech Attribute Editing
- Discrete Phonetic Posteriorgram
- Speaker Control
- Emotion Control
- Content Editing
- Diffusion Models
- Autoregressive Transformers
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.