DASB -- Discrete Audio and Speech Benchmark
Summary
The Discrete Audio and Speech Benchmark (DASB) is a new, comprehensive leaderboard designed to evaluate discrete audio tokens across a wide array of discriminative and generative speech processing tasks. Developed by researchers from Concordia University, Mila, Avignon Université, Université de Montréal, and Université Laval, DASB addresses the inconsistency in existing audio token evaluation settings. It benchmarks semantic tokens (e.g., Discrete HuBERT, WavLM, Wav2Vec2), compression tokens (EnCodec, DAC), and hybrid tokens (SpeechTokenizer) across tasks like speech recognition, speaker identification, emotion recognition, keyword spotting, intent classification, speech enhancement, separation, and text-to-speech. The benchmark reveals that semantic tokens generally outperform compression tokens in most tasks, with discrete WavLM emerging as the top-performing model. However, a significant performance gap persists between discrete tokens and continuous representations, indicating a need for further research.
Key takeaway
For AI Engineers and Research Scientists developing audio processing systems, DASB offers a critical resource for selecting optimal discrete audio tokens. Your choice of tokenizer should align with the specific task: semantic tokens like discrete WavLM are generally superior for most discriminative and generative tasks, while compression tokens like EnCodec are more effective for preserving speaker identity and offer better efficiency for streaming applications. Be aware that discrete tokens still lag behind continuous representations, indicating ongoing research is needed to close this performance gap.
Key insights
DASB provides a standardized benchmark for discrete audio tokens, showing semantic tokens generally outperform compression tokens.
Principles
- Optimal tokenizer choice is task-dependent.
- Semantic tokens excel in high-level information capture.
- Compression tokens better preserve speaker identity.
Method
DASB evaluates audio tokens by converting audio to discrete tokens via a frozen encoder, feeding them to a downstream neural model with attention, and for generative tasks, converting predicted tokens back to audio via a frozen decoder.
In practice
- Use discrete WavLM for multimodal text+audio LLMs.
- Consider EnCodec for streaming tasks due to efficiency.
- Prioritize compression tokens for speaker identity preservation.
Topics
- Discrete Audio Tokens
- DASB Benchmark
- Multimodal LLMs
- Semantic Tokens
- Compression Tokens
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.