Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors
Summary
ScenA is a novel method for generating multi-speaker audio scenes by conditioning a text-to-audio flow-matching foundation model on multiple reference voices and a free-form natural language prompt. This model, pretrained on extensive "in-the-wild" data, inherently produces realistic audio, including background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, without requiring per-turn structured supervision. A key challenge, the "Reference Shortcut," where the model relies on acoustic similarity to noisy targets instead of the text prompt for speaker assignment, was overcome using a high-noise-biased timestep distribution during training. Evaluated on the CoVoMix2-Dialogue benchmark, ScenA demonstrates superior performance in speaker-binding metrics compared to existing multi-speaker systems, while also generating rich conversational audio with complex elements like emotional vocalizations and ambient sounds.
Key takeaway
For AI Engineers developing multi-speaker audio generation systems, ScenA offers a robust approach to creating natural, complex audio scenes. You should consider leveraging foundation models pretrained on "in-the-wild" data to capture realistic ambient textures and overlapping dialogue. Implement identity-aware positional encodings for speaker control and apply high-noise-biased training to ensure your model relies on text prompts for accurate speaker assignment, avoiding acoustic shortcuts. This method can significantly enhance the realism and control of your generated audio.
Key insights
ScenA generates realistic multi-speaker audio scenes using reference voices and text prompts, overcoming "Reference Shortcut" with noise-biased training.
Principles
- In-the-wild pretraining yields natural audio.
- Reference latents enable multi-speaker control.
- High-noise bias forces text prompt reliance.
Method
ScenA conditions a text-to-audio flow-matching model with reference latents concatenated into the token sequence, distinguished by identity-aware positional encodings, and uses a high-noise-biased timestep distribution.
In practice
- Use flow-matching models for scene generation.
- Implement identity-aware positional encodings.
- Apply noise-biased training to prevent shortcuts.
Topics
- Multi-speaker Audio Generation
- Audio Scene Generation
- Text-to-Audio Models
- Flow-Matching Networks
- Reference Shortcut Mitigation
- CoVoMix2-Dialogue
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.