SwiftAudio: Data-Efficient Caption-Only Distillation for One-Step Text-to-Audio Diffusion-based Generation
Summary
SwiftAudio is a novel one-step text-to-audio (TTA) diffusion framework designed to overcome the high inference latency of iterative multi-step denoising models and the paired text-audio data dependency of existing one-step approaches. It achieves audio-free distillation from a pretrained diffusion teacher using only text captions. The framework adapts Variational Score Distillation (VSD) to the audio domain and incorporates a temporal smoothness regularization objective to ensure coherent latent audio representations. This design allows SwiftAudio to inherit the teacher's generative prior without requiring paired audio supervision, enabling effective training with approximately 45K captions. Experiments on AudioCaps and Clotho datasets demonstrate that SwiftAudio achieves state-of-the-art performance among strict one-step methods and substantially reduces the performance gap to multi-step diffusion systems.
Key takeaway
For machine learning engineers developing text-to-audio generation systems, SwiftAudio offers a critical pathway to significantly reduce inference latency and data requirements. You should consider this one-step, caption-only distillation approach to build high-quality TTA models, especially when paired audio data is scarce. This method allows you to achieve state-of-the-art performance among one-step solutions, narrowing the gap to more resource-intensive multi-step systems.
Key insights
SwiftAudio enables one-step text-to-audio generation via audio-free distillation using only text captions, significantly reducing latency and data requirements.
Principles
- Audio-free distillation trains TTA models.
- Temporal smoothness ensures latent audio coherence.
- Caption-only training reduces data dependency.
Method
Adapts Variational Score Distillation (VSD) for audio, then applies temporal smoothness regularization during distillation from a pretrained teacher using only text captions to generate coherent latent audio representations.
In practice
- Develop TTA models with limited paired audio data.
- Reduce inference latency for audio generation.
- Train TTA systems using only text captions.
Topics
- Text-to-Audio Generation
- Diffusion Models
- One-Step Inference
- Variational Score Distillation
- Data-Efficient Training
- AudioCaps
- Clotho
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.