Adding Benchmaxxer Repellant to the Open ASR Leaderboard
Summary
The Hugging Face Open ASR Leaderboard, launched in September 2023, has introduced new private datasets from Appen Inc. and DataoceanAI to combat "benchmaxxing" and improve the trustworthiness of its evaluations. These high-quality English ASR datasets cover both scripted and conversational speech across multiple accents, including Australian, Canadian, Indian, American, and British English. While the datasets are private to prevent test-set contamination, the leaderboard allows users to optionally include them in the average Word Error Rate (WER) calculation via a toggle. The default average WER continues to be computed only on public datasets. This update aims to provide a more holistic view of ASR performance by highlighting model strengths and weaknesses across diverse conditions, such as scripted vs. conversational styles and American vs. non-American accents.
Key takeaway
For AI Engineers and Research Scientists evaluating ASR models, leverage the Open ASR Leaderboard's new private datasets to gain a more robust understanding of model performance beyond public benchmarks. Actively use the "Private data" toggle to assess how models perform on diverse accents and conversational speech, which can reveal critical gaps and biases not apparent in standard evaluations. This approach helps identify models truly optimized for real-world, varied audio conditions.
Key insights
Private datasets enhance ASR leaderboard trustworthiness by preventing benchmaxxing and revealing nuanced model performance.
Principles
- Standardization is crucial for meaningful benchmarking.
- Openness fosters community improvement and contributions.
- Private datasets increase evaluation trustworthiness.
Method
New private datasets are integrated into the Open ASR Leaderboard, with an optional toggle to include them in WER calculations. Evaluation scripts and UI code remain open-sourced, and a normalizer standardizes model outputs.
In practice
- Use the leaderboard's toggle to include private datasets.
- Submit models via GitHub PR for private evaluation.
- Self-report public metrics in model cards for unverified listing.
Topics
- Open ASR Leaderboard
- Automatic Speech Recognition
- Benchmarking
- Dataset Contamination
- Appen Inc.
Code references
Best for: AI Engineer, Research Scientist, Machine Learning Engineer, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.