Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

Argmax, Inc. and UCLA introduce Contextual Earnings-22, an open dataset designed to standardize benchmarking for contextual speech-to-text (STT) systems. This dataset, built upon Earnings-22, addresses the plateauing accuracy of STT on academic benchmarks by focusing on custom vocabulary recognition, which is critical for real-world usability. Contextual Earnings-22 features 760 context-dense 15-second audio clips from earnings calls, paired with manually reviewed transcripts and realistic custom vocabulary contexts, including person, company, and product names. It supports evaluation in two scenarios: local context (precise, no distractors) and global context (realistic, with distractors). The researchers established six strong baselines using both keyword prompting and keyword boosting methods, demonstrating significant improvements in contextual term recognition, though robustness to distractors remains a key differentiator.

Key takeaway

For AI Engineers and Research Scientists developing or deploying speech-to-text systems, Contextual Earnings-22 provides a crucial public benchmark to assess real-world performance. You should utilize this dataset to rigorously evaluate your models' ability to handle custom vocabularies and measure their robustness against distractors, which is often overlooked by traditional WER metrics. This will help you identify systems that truly excel in high-stakes, context-dependent applications like earnings call transcription.

Key insights

Contextual Earnings-22 offers a standardized benchmark for evaluating speech-to-text systems' custom vocabulary recognition in realistic settings.

Principles

Method

The Contextual Earnings-22 pipeline extracts contextual keywords using GPT-5, segments transcripts, performs forced alignment with wav2vec, and manually reviews/corrects data to create 15-second audio clips with local and global context lists.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.