BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

BaltiVoice introduces a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language previously lacking public Automatic Speech Recognition (ASR) resources. This corpus comprises 10,060 validated utterances in native Nastaliq script, sourced from Mozilla Common Voice recordings. Researchers fine-tuned OpenAI Whisper-small on this new dataset, achieving a Word Error Rate (WER) of 30.07% on a 538-utterance validation set. This represents a significant improvement from Whisper-small's zero-shot baseline WER of 182.18% for Balti. The complete dataset, the fine-tuned model, and a live transcription demonstration are all publicly accessible on HuggingFace.

Key takeaway

For NLP Engineers or Machine Learning Engineers tasked with developing ASR systems for low-resource languages, this work demonstrates a viable path. You should consider creating a focused read-speech corpus, even if relatively small (e.g., 16.8 hours), and fine-tuning a pre-trained model like Whisper-small. This approach can yield substantial performance gains, transforming a non-functional zero-shot baseline into a practical system for your target language.

Key insights

Fine-tuning pre-trained ASR models with modest, newly created corpora can establish functional speech recognition for low-resource languages.

Principles

Method

A 16.8-hour read-speech corpus was created from Mozilla Common Voice, then used to fine-tune OpenAI Whisper-small, achieving a 30.07% WER.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.