AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

AST, an Adaptive, Seamless, and Training-free framework, enables precise text-based speech editing by leveraging pre-trained autoregressive Text-to-Speech (TTS) models without requiring task-specific training. It addresses the trade-off between editing quality and consistency by introducing Latent Recomposition, which selectively stitches preserved source speech segments with newly synthesized target content in the latent space. To prevent artifacts at edit boundaries, AST incorporates Adaptive Weak Fact Guidance (AWFG), dynamically modulating a mel-space guidance signal to enforce structural constraints only where necessary. The framework also supports precise style editing for specific speech segments. The authors introduce LibriSpeech-Edit, a new public benchmark dataset, and Word-level Dynamic Time Warping (WDTW), a novel metric for evaluating temporal consistency in unedited regions. Experiments show AST improves consistency and reduces Word Error Rate by nearly 70% compared to previous baselines, achieving state-of-the-art speaker preservation and temporal fidelity.

Key takeaway

For research scientists developing speech editing systems, AST demonstrates that sophisticated latent space manipulation in pre-trained AM-FM TTS models can achieve state-of-the-art performance without costly task-specific training. You should explore adapting existing powerful generative models through inversion and adaptive guidance mechanisms to solve complex editing tasks, potentially reducing development time and data requirements.

Key insights

AST enables training-free, precise speech editing by manipulating latent spaces of pre-trained autoregressive TTS models.

Principles

Latent recomposition preserves acoustic context.
Adaptive guidance prevents boundary artifacts.
Flow-matching models offer continuous latent manipulation.

Method

AST inverts source speech into latent space, performs word-level alignment for Latent Recomposition, and uses Adaptive Weak Fact Guidance (AWFG) during flow-matching generation to synthesize edited speech with preserved context.

In practice

Use AST for precise content and style speech editing.
Apply AWFG to smooth transitions in generative models.
Utilize LibriSpeech-Edit for speech editing benchmarks.

Topics

Precise Speech Editing
Latent Recomposition
Adaptive Weak Fact Guidance
Training-Free Framework
LibriSpeech-Edit Dataset

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.