UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Processing · Depth: Expert, quick

Summary

UniSAE is a novel unified speech attribute editing framework designed to modify specific portions of an utterance while preserving the rest. Unlike existing approaches that treat content, speaker, and emotion editing as separate tasks, UniSAE supports composable editing of these attributes from sub-phoneme to word level within a single architecture. It introduces a Discrete Phonetic PosteriorGram (DPPG) representation, which factorizes speech content into discrete tokens encoding phoneme identity, pronunciation variants, and duration, enabling direct phoneme- and sub-phoneme-level editing. For higher-level modifications, an autoregressive content transformer predicts edited DPPG sequences for word-level content editing. The framework then uses a diffusion-based acoustic decoder, conditioned on disentangled speaker and emotion representations, to render the edited sequences into speech. Experimental results demonstrate UniSAE's capability for precise speaker and emotion control, content editing at multiple granularities, and joint modification of all three attributes.

Key takeaway

For Audio and Speech Processing Engineers developing advanced speech synthesis or editing systems, UniSAE presents a significant architectural shift. Its unified framework allows you to precisely control speaker identity, emotional tone, and content from sub-phoneme to word level, overcoming limitations of separate task-specific models. Consider integrating discrete phonetic posteriorgram modeling to achieve more flexible and composable speech attribute modifications in your next-generation applications.

Key insights

UniSAE unifies speaker, emotion, and content editing from sub-phoneme to word level using Discrete Phonetic PosteriorGram (DPPG) modeling.

Principles

Method

UniSAE uses DPPG for sub-phoneme editing, an autoregressive content transformer for word-level DPPG sequences, and a diffusion-based acoustic decoder conditioned on disentangled speaker/emotion representations.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.