Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech
Summary
A study investigates neural speaker diarization for low-resource Nepali-Hindi speech using a multilingual training approach. Speaker diarization, which identifies "who spoke when," typically performs poorly for underrepresented languages due to limited annotated data. Researchers compared two modern architectures, EEND with encoder-decoder attractors (EEND-EDA) and EEND with Perceiver-based attractors (DiaPer), trained on a corpus combining English (LibriSpeech), diverse speakers (VoxCeleb), and collected Nepali and Hindi audio. This setup aimed to reduce language bias and promote cross-lingual generalization. Evaluated across 2-speaker, 3-speaker, 4-speaker, and mixed-speaker scenarios on LibriSpeech, VoxCeleb, and Nepali-Hindi (NeHi) test sets, DiaPer achieved stronger overall performance than EEND-EDA. Specifically, DiaPer obtained DERs of 3.28%, 2.02%, 4.05%, and 4.76% on NeHi 2-speaker, 3-speaker, 4-speaker, and mixed-speaker settings, respectively, demonstrating the viability of Perceiver-based end-to-end neural diarization for low-resource multilingual speech processing.
Key takeaway
For Machine Learning Engineers developing speech applications for underrepresented languages, this research indicates that multilingual training is a viable strategy to overcome data scarcity. You should consider Perceiver-based end-to-end neural diarization (DiaPer) architectures, as they demonstrated superior performance, especially in challenging multi-speaker scenarios. Integrating diverse language corpora, including high-resource and low-resource data, into your training regimen can significantly improve diarization accuracy for languages like Nepali and Hindi.
Key insights
Multilingual training with Perceiver-based attractors significantly improves speaker diarization for low-resource languages like Nepali-Hindi.
Principles
- Multilingual training reduces language bias.
- Cross-lingual generalization is achievable for diarization.
- Perceiver-based attractors enhance multi-speaker performance.
Method
Train EEND models (EEND-EDA, DiaPer) on a multilingual corpus combining high-resource (English) and low-resource (Nepali-Hindi) speech to improve diarization for underrepresented languages.
In practice
- Apply DiaPer for low-resource speech diarization.
- Combine diverse language datasets for training.
- Evaluate on varied speaker count scenarios.
Topics
- Speaker Diarization
- Multilingual Training
- Low-Resource Languages
- Neural Networks
- Perceiver Models
- Speech Processing
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.