Adding Benchmaxxer Repellant to the Open ASR Leaderboard

2026-05-04 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, medium

Summary

The Hugging Face Open ASR Leaderboard, launched in September 2023, has introduced new private datasets from Appen Inc. and DataoceanAI to combat "benchmaxxing" and improve the trustworthiness of its evaluations. These high-quality English ASR datasets cover both scripted and conversational speech across multiple accents, including Australian, Canadian, Indian, American, and British English. While the datasets are private to prevent test-set contamination, the leaderboard allows users to optionally include them in the average Word Error Rate (WER) calculation via a toggle. The default average WER continues to be computed only on public datasets. This update aims to provide a more holistic view of ASR performance by highlighting model strengths and weaknesses across diverse conditions, such as scripted vs. conversational styles and American vs. non-American accents.

Key takeaway

For AI Engineers and Research Scientists evaluating ASR models, leverage the Open ASR Leaderboard's new private datasets to gain a more robust understanding of model performance beyond public benchmarks. Actively use the "Private data" toggle to assess how models perform on diverse accents and conversational speech, which can reveal critical gaps and biases not apparent in standard evaluations. This approach helps identify models truly optimized for real-world, varied audio conditions.

Key insights

Private datasets enhance ASR leaderboard trustworthiness by preventing benchmaxxing and revealing nuanced model performance.

Principles

Standardization is crucial for meaningful benchmarking.
Openness fosters community improvement and contributions.
Private datasets increase evaluation trustworthiness.

Method

New private datasets are integrated into the Open ASR Leaderboard, with an optional toggle to include them in WER calculations. Evaluation scripts and UI code remain open-sourced, and a normalizer standardizes model outputs.

In practice

Use the leaderboard's toggle to include private datasets.
Submit models via GitHub PR for private evaluation.
Self-report public metrics in model cards for unverified listing.

Topics

Open ASR Leaderboard
Automatic Speech Recognition
Benchmarking
Dataset Contamination
Appen Inc.

Code references

Best for: AI Engineer, Research Scientist, Machine Learning Engineer, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.