HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec
Summary
HybridCodec is a novel neural audio codec architecture that unifies dual-stream and semantic distillation paradigms to enhance speech tokenization for Multimodal Large Language Models. This design employs separate semantic and acoustic branches while distilling self-supervised learning (SSL) representations into the semantic stream, eliminating the need for a large SSL model during inference. HybridCodec achieves superior semantic specialization, demonstrated by the lowest RVQ-1 Word Error Rate (WER) of 15.36% on the LibriSpeech test set, outperforming DualCodec (18.93%) and DAC (Distill) (21.54%). It also maintains competitive reconstruction quality at RVQ-1:12 (4.46%). Crucially, HybridCodec delivers a 3x speedup over existing dual-stream models like DualCodec, achieving a Real-Time Factor (RTF) of 0.014 on an NVIDIA RTX A6000 GPU. The model shows robust generalization in out-of-domain (SeedTTS-en) and zero-shot cross-lingual (Common Voice French) evaluations, with extended training to 300k updates further improving RVQ-1 WER to 12.96%.
Key takeaway
For NLP engineers or ML teams developing Multimodal Large Language Models requiring efficient speech tokenization, HybridCodec offers a compelling solution. You should consider adopting this architecture to achieve superior semantic disentanglement (RVQ-1 WER 12.96% with extended training) and a 3x inference speedup over traditional dual-stream models. This allows for robust performance in real-time and cross-lingual applications without the overhead of a large SSL model during inference.
Key insights
HybridCodec combines dual-stream and distillation for fast, semantically specialized neural audio coding without SSL inference overhead.
Principles
- Semantic-acoustic disentanglement is crucial for speech tokenization.
- Combining dual-stream and distillation improves efficiency and specialization.
- Explicitly modeling residuals enhances semantic disentanglement.
Method
HybridCodec uses a common CNN encoder/decoder, with separate semantic and acoustic streams. SSL embeddings (w2v-BERT-2.0 layer 16) are distilled into the semantic stream via L2-loss, and the acoustic stream models residuals.
In practice
- Distill SSL features into the first RVQ layer for fast inference.
- Employ dual-stream architectures for explicit semantic-acoustic separation.
Topics
- Neural Audio Codec
- Speech Tokenization
- Multimodal Large Language Models
- Semantic Disentanglement
- Self-Supervised Learning
- Real-Time Factor
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.