An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Recognition · Depth: Advanced, quick

Summary

This paper presents a multistage approach for Bengali long-form speech transcription and speaker diarization, addressing the challenges of this low-resource language for the "DL Sprint 4.0" Kaggle competitions. For transcription, the authors fine-tuned Whisper Medium on Bengali data, achieving a Word Error Rate (WER) of 0.38 on the private leaderboard through techniques like chunking and background noise cleaning. Speaker diarization utilized `pyannote/speaker-diarization-community-1` with a custom segmentation model and a two-pass method, resulting in a Diarization Error Rate (DER) of 0.27 on the private leaderboard and 0.19 on the public leaderboard. These results demonstrate that targeted tuning and strategic data utilization can significantly enhance AI inclusivity for South Asian languages, with all relevant code publicly available.

Key takeaway

For low-resource Bengali long-form speech, a multistage approach combining fine-tuned Whisper Medium and a two-pass pyannote diarization model achieved a 0.38 WER for transcription and 0.27 DER for speaker diarization. This demonstrates that strategic fine-tuning and data utilization can significantly enhance AI inclusivity for South Asian languages, providing practical solutions for complex audio processing.

Topics

Code references

Best for: Machine Learning Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.