Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

The "Voice of India" (VoI) benchmark is a new closed-source dataset designed to evaluate Automatic Speech Recognition (ASR) systems for real-world Indian languages, addressing limitations of existing benchmarks that often use scripted speech and strict Word Error Rate (WER) metrics. VoI comprises 306,230 utterances, totaling 536 hours of unscripted telephonic conversations from 36,691 speakers across 15 major Indian languages and 139 regional clusters. It incorporates multiple valid transcripts to account for natural spelling variations, including code-mixed English origin words, and uses Orthographically-Informed Word Error Rate (OIWER) for evaluation. Analysis reveals significant geographic disparities in ASR performance, with higher error rates in linguistically diverse regions like parts of South India and North Bihar, and lower rates in the Hindi belt and metropolitan areas. The benchmark also details performance across factors such as audio quality, speaking rate, gender, and device type, identifying specific areas where current ASR systems struggle.

Key takeaway

For AI Engineers and Research Scientists developing ASR systems for Indian languages, this benchmark highlights that current models exhibit substantial robustness gaps, particularly in low-resource languages and specific geographic regions. You should prioritize targeted regional data collection, gender-stratified training, and explicit cross-regional evaluation metrics to improve real-world performance. Relying solely on public benchmarks or single-reference WER can lead to an overestimation of system capabilities and mask critical failures in diverse linguistic contexts.

Key insights

Real-world ASR performance in India requires unscripted data, orthographically-informed metrics, and regional analysis to overcome current benchmark limitations.

Principles

Method

The Voice of India benchmark uses population-proportional cluster sampling for geographic balance, GPT-4.5 for prompt generation, and a machine-assisted multi-annotator pipeline with Gemini 3 Flash for lattice-based transcription and variation generation.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.