Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Voice of India is a new closed-source benchmark designed for real-world Automatic Speech Recognition (ASR) in India, addressing limitations of existing benchmarks that often rely on scripted, clean speech and strict single-reference Word Error Rate (WER) evaluation. This new dataset comprises 306,230 utterances, totaling 536 hours of unscripted telephonic conversations from 36,691 speakers across 15 major Indian languages and 139 regional clusters. Its transcripts account for natural spelling variations, including non-standardized spellings of code-mixed English origin words. The benchmark also provides geographical performance analysis at the district level, revealing disparities, and detailed analysis across factors like audio quality, speaking rate, gender, and device type to pinpoint current ASR system weaknesses.

Key takeaway

For AI Engineers developing ASR systems for Indian languages, this benchmark highlights the need to move beyond clean, scripted data. You should prioritize training and evaluating models on diverse, unscripted telephonic conversations that account for natural spelling variations and code-mixing. Focusing on performance disparities at the district level and across factors like audio quality will lead to more robust and equitable real-world ASR solutions.

Key insights

Real-world Indic ASR requires benchmarks with unscripted speech, diverse languages, and flexible spelling.

Principles

Method

The Voice of India benchmark was built from unscripted telephonic conversations, covering 15 Indian languages and 139 regional clusters, with transcripts accounting for spelling variations and geographical analysis.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.