DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

DirectAudioEdit introduces a novel training-free and inversion-free method for text-guided audio editing with pretrained diffusion models. It leverages diffusion prediction contrast, shared-noise re-noising, and a dynamic guidance schedule to directly construct a source-to-target editing path, bypassing the computational overhead and reconstruction errors of inversion-based techniques. Experiments on music and event-level benchmarks, utilizing AudioLDM2 and Tango2 backbones, demonstrate that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% respectively, compared to DDPM inversion. Furthermore, it achieves up to 64.5% editing speedup and 85.0% performance improvement, while maintaining competitive target alignment and superior source preservation.

Key takeaway

For AI Engineers developing text-guided audio editing solutions, DirectAudioEdit offers a significant efficiency and quality improvement. You can achieve faster inference (up to 64.5% speedup) and reduced distortion (15.9% lower FAD) compared to inversion-based methods, especially for music and event-level tasks. Consider integrating this inversion-free approach to enhance user experience and computational resource utilization in your applications.

Key insights

DirectAudioEdit enables inversion-free, training-free audio editing in diffusion models by directly constructing target states via diffusion prediction contrast.

Principles

Inversion-free editing reduces computational overhead and reconstruction errors.
Diffusion models' curved paths make direct inversion-free application suboptimal.
Dynamic guidance balances editing strength and source preservation.

Method

DirectAudioEdit uses shared-noise re-noising to make source/target branches comparable, then applies diffusion prediction contrast for editing direction, controlled by a dynamic guidance schedule.

In practice

Edit music to preserve global structure like rhythm and timbre.
Perform event-level edits: add, remove, or replace specific sounds.
Achieve faster audio editing on NVIDIA RTX 3090 GPUs.

Topics

Text-guided Audio Editing
Diffusion Models
Inversion-free Editing
AudioLDM2
Tango2
Inference Efficiency
Audio Quality Metrics

Code references

haoheliu/audioldm_eval

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.