Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors
Summary
ScenA, a novel method, introduces reference-driven multi-speaker audio scene generation by conditioning a text-to-audio flow-matching foundation model on multiple reference voices and a free-form natural language prompt. Pretrained on large-scale in-the-wild data, ScenA inherits the capacity to produce natural, non-studio audio, including background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, without requiring per-turn structural supervision. The system addresses a "Reference Shortcut" obstacle during training by employing a high-noise-biased timestep distribution, forcing the model to rely on the text prompt for speaker assignment. Evaluated on the CoVoMix2-Dialogue benchmark, ScenA outperforms existing multi-speaker systems in speaker-binding metrics and generates rich conversational audio.
Key takeaway
For AI scientists developing multi-speaker dialogue systems, ScenA demonstrates a superior approach to generating realistic conversational audio. You should explore conditioning large-scale, in-the-wild pretrained text-to-audio models with reference voices and free-form prompts. This method generates rich, natural audio scenes, including background noise and overlapping speech, outperforming structured speech-only pipelines. Implement high-noise-biased training to prevent reference shortcuts and ensure accurate speaker assignment via text.
Key insights
Leveraging in-the-wild pretrained foundation models enables realistic multi-speaker audio scene generation from text and reference voices.
Principles
- Conditioning on reference voices and free-form prompts enhances audio scene realism.
- High-noise-biased training prevents reference shortcuts in text-to-audio models.
- General-purpose audio models surpass speech-only pipelines for scene generation.
Method
ScenA concatenates reference latents into the model's token sequence, distinguished by identity-aware positional encodings, and uses a high-noise-biased timestep distribution during training to ensure text prompt reliance.
In practice
- Integrate reference voices and natural language prompts for complex audio scenes.
- Employ noise-biased training to ensure text prompt reliance for speaker assignment.
- Consider foundation models over speech-only pipelines for ambient audio generation.
Topics
- Multi-Speaker Audio Generation
- Audio Scene Generation
- Text-to-Audio Models
- Flow-Matching
- Reference-Driven AI
- Speaker Binding
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.