Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

2026-06-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Processing · Depth: Expert, quick

Summary

A study investigates neural speaker diarization for low-resource Nepali-Hindi speech using a multilingual training approach. Speaker diarization, which identifies "who spoke when," typically performs poorly for underrepresented languages due to limited annotated data. Researchers compared two modern architectures, EEND with encoder-decoder attractors (EEND-EDA) and EEND with Perceiver-based attractors (DiaPer), trained on a corpus combining English (LibriSpeech), diverse speakers (VoxCeleb), and collected Nepali and Hindi audio. This setup aimed to reduce language bias and promote cross-lingual generalization. Evaluated across 2-speaker, 3-speaker, 4-speaker, and mixed-speaker scenarios on LibriSpeech, VoxCeleb, and Nepali-Hindi (NeHi) test sets, DiaPer achieved stronger overall performance than EEND-EDA. Specifically, DiaPer obtained DERs of 3.28%, 2.02%, 4.05%, and 4.76% on NeHi 2-speaker, 3-speaker, 4-speaker, and mixed-speaker settings, respectively, demonstrating the viability of Perceiver-based end-to-end neural diarization for low-resource multilingual speech processing.

Key takeaway

For Machine Learning Engineers developing speech applications for underrepresented languages, this research indicates that multilingual training is a viable strategy to overcome data scarcity. You should consider Perceiver-based end-to-end neural diarization (DiaPer) architectures, as they demonstrated superior performance, especially in challenging multi-speaker scenarios. Integrating diverse language corpora, including high-resource and low-resource data, into your training regimen can significantly improve diarization accuracy for languages like Nepali and Hindi.

Key insights

Multilingual training with Perceiver-based attractors significantly improves speaker diarization for low-resource languages like Nepali-Hindi.

Principles

Multilingual training reduces language bias.
Cross-lingual generalization is achievable for diarization.
Perceiver-based attractors enhance multi-speaker performance.

Method

Train EEND models (EEND-EDA, DiaPer) on a multilingual corpus combining high-resource (English) and low-resource (Nepali-Hindi) speech to improve diarization for underrepresented languages.

In practice

Apply DiaPer for low-resource speech diarization.
Combine diverse language datasets for training.
Evaluate on varied speaker count scenarios.

Topics

Speaker Diarization
Multilingual Training
Low-Resource Languages
Neural Networks
Perceiver Models
Speech Processing

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.