Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild
Summary
Argmax, Inc. and UCLA introduce Contextual Earnings-22, an open dataset designed to standardize benchmarking for contextual speech-to-text (STT) systems. This dataset, built upon Earnings-22, addresses the plateauing accuracy of STT on academic benchmarks by focusing on custom vocabulary recognition, which is critical for real-world usability. Contextual Earnings-22 features 760 context-dense 15-second audio clips from earnings calls, paired with manually reviewed transcripts and realistic custom vocabulary contexts, including person, company, and product names. It supports evaluation in two scenarios: local context (precise, no distractors) and global context (realistic, with distractors). The researchers established six strong baselines using both keyword prompting and keyword boosting methods, demonstrating significant improvements in contextual term recognition, though robustness to distractors remains a key differentiator.
Key takeaway
For AI Engineers and Research Scientists developing or deploying speech-to-text systems, Contextual Earnings-22 provides a crucial public benchmark to assess real-world performance. You should utilize this dataset to rigorously evaluate your models' ability to handle custom vocabularies and measure their robustness against distractors, which is often overlooked by traditional WER metrics. This will help you identify systems that truly excel in high-stakes, context-dependent applications like earnings call transcription.
Key insights
Contextual Earnings-22 offers a standardized benchmark for evaluating speech-to-text systems' custom vocabulary recognition in realistic settings.
Principles
- Custom vocabulary accuracy is critical for real-world STT utility.
- Contextual conditioning significantly improves custom term recognition.
- Robustness to distractors differentiates STT system performance.
Method
The Contextual Earnings-22 pipeline extracts contextual keywords using GPT-5, segments transcripts, performs forced alignment with wav2vec, and manually reviews/corrects data to create 15-second audio clips with local and global context lists.
In practice
- Use keyword-centric metrics alongside WER for STT evaluation.
- Test STT systems with both precise and noisy context lists.
- Consider distractor robustness when selecting STT solutions.
Topics
- Speech Recognition Benchmarks
- Contextual Speech-to-Text
- Custom Vocabulary
- Earnings-22 Dataset
- Keyword Prompting
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.