AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
Summary
AST, an Adaptive, Seamless, and Training-free framework, enables precise text-based speech editing by leveraging pre-trained autoregressive Text-to-Speech (TTS) models without requiring task-specific training. It addresses the trade-off between editing quality and consistency by introducing Latent Recomposition, which selectively stitches preserved source speech segments with newly synthesized target content in the latent space. To prevent artifacts at edit boundaries, AST incorporates Adaptive Weak Fact Guidance (AWFG), dynamically modulating a mel-space guidance signal to enforce structural constraints only where necessary. The framework also supports precise style editing for specific speech segments. The authors introduce LibriSpeech-Edit, a new public benchmark dataset, and Word-level Dynamic Time Warping (WDTW), a novel metric for evaluating temporal consistency in unedited regions. Experiments show AST improves consistency and reduces Word Error Rate by nearly 70% compared to previous baselines, achieving state-of-the-art speaker preservation and temporal fidelity.
Key takeaway
For research scientists developing speech editing systems, AST demonstrates that sophisticated latent space manipulation in pre-trained AM-FM TTS models can achieve state-of-the-art performance without costly task-specific training. You should explore adapting existing powerful generative models through inversion and adaptive guidance mechanisms to solve complex editing tasks, potentially reducing development time and data requirements.
Key insights
AST enables training-free, precise speech editing by manipulating latent spaces of pre-trained autoregressive TTS models.
Principles
- Latent recomposition preserves acoustic context.
- Adaptive guidance prevents boundary artifacts.
- Flow-matching models offer continuous latent manipulation.
Method
AST inverts source speech into latent space, performs word-level alignment for Latent Recomposition, and uses Adaptive Weak Fact Guidance (AWFG) during flow-matching generation to synthesize edited speech with preserved context.
In practice
- Use AST for precise content and style speech editing.
- Apply AWFG to smooth transitions in generative models.
- Utilize LibriSpeech-Edit for speech editing benchmarks.
Topics
- Precise Speech Editing
- Latent Recomposition
- Adaptive Weak Fact Guidance
- Training-Free Framework
- LibriSpeech-Edit Dataset
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.