DASB -- Discrete Audio and Speech Benchmark

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech & Audio Processing · Depth: Expert, extended

Summary

The Discrete Audio and Speech Benchmark (DASB) is a new, comprehensive leaderboard designed to evaluate discrete audio tokens across a wide array of discriminative and generative speech processing tasks. Developed by researchers from Concordia University, Mila, Avignon Université, Université de Montréal, and Université Laval, DASB addresses the inconsistency in existing audio token evaluation settings. It benchmarks semantic tokens (e.g., Discrete HuBERT, WavLM, Wav2Vec2), compression tokens (EnCodec, DAC), and hybrid tokens (SpeechTokenizer) across tasks like speech recognition, speaker identification, emotion recognition, keyword spotting, intent classification, speech enhancement, separation, and text-to-speech. The benchmark reveals that semantic tokens generally outperform compression tokens in most tasks, with discrete WavLM emerging as the top-performing model. However, a significant performance gap persists between discrete tokens and continuous representations, indicating a need for further research.

Key takeaway

For AI Engineers and Research Scientists developing audio processing systems, DASB offers a critical resource for selecting optimal discrete audio tokens. Your choice of tokenizer should align with the specific task: semantic tokens like discrete WavLM are generally superior for most discriminative and generative tasks, while compression tokens like EnCodec are more effective for preserving speaker identity and offer better efficiency for streaming applications. Be aware that discrete tokens still lag behind continuous representations, indicating ongoing research is needed to close this performance gap.

Key insights

DASB provides a standardized benchmark for discrete audio tokens, showing semantic tokens generally outperform compression tokens.

Principles

Method

DASB evaluates audio tokens by converting audio to discrete tokens via a frozen encoder, feeding them to a downstream neural model with attention, and for generative tasks, converting predicted tokens back to audio via a frozen decoder.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.