BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

BaltiVoice introduces a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language previously lacking public Automatic Speech Recognition (ASR) resources. This corpus comprises 10,060 validated utterances in native Nastaliq script, sourced from Mozilla Common Voice recordings. Researchers fine-tuned OpenAI Whisper-small on this new dataset, achieving a Word Error Rate (WER) of 30.07% on a 538-utterance validation set. This represents a significant improvement from Whisper-small's zero-shot baseline WER of 182.18% for Balti. The complete dataset, the fine-tuned model, and a live transcription demonstration are all publicly accessible on HuggingFace.

Key takeaway

For NLP Engineers or Machine Learning Engineers tasked with developing ASR systems for low-resource languages, this work demonstrates a viable path. You should consider creating a focused read-speech corpus, even if relatively small (e.g., 16.8 hours), and fine-tuning a pre-trained model like Whisper-small. This approach can yield substantial performance gains, transforming a non-functional zero-shot baseline into a practical system for your target language.

Key insights

Fine-tuning pre-trained ASR models with modest, newly created corpora can establish functional speech recognition for low-resource languages.

Principles

Publicly available resources enable ASR development for underserved languages.
Fine-tuning significantly reduces ASR error rates on new languages.

Method

A 16.8-hour read-speech corpus was created from Mozilla Common Voice, then used to fine-tune OpenAI Whisper-small, achieving a 30.07% WER.

In practice

Utilize Mozilla Common Voice for low-resource language corpus generation.
Fine-tune Whisper-small for initial ASR system development.

Topics

Balti Language
Automatic Speech Recognition
Speech Corpus
Whisper Model
Low-Resource Languages
Fine-tuning
Nastaliq Script

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.