Your Ground Truth Is Wrong: Evaluating STT with truth files & semantic WER | AssemblyAI Workshop
Summary
AssemblyAI's Applied AI team, led by Zach and David, presented a workshop on improving speech-to-text (STT) benchmarking, highlighting critical flaws in the traditional Word Error Rate (WER) metric. They introduced Universal 3 Pro, their newest model, which demonstrated such high accuracy that it exposed errors in human-labeled ground truth files, leading to misleadingly worse WER scores for the superior model. To address this, AssemblyAI launched the "Truth File Corrector" tool, enabling users to manually review and update ground truth files by comparing them against Universal 3 Pro's transcriptions. The workshop also covered the Speech-to-Text Benchmarking SDK, which includes advanced metrics like Semantic WER and Missed Entity Rate (MER), and demonstrated using Large Language Models (LLMs) as judges for more nuanced evaluation. They emphasized the importance of A/B testing STT models in production and discussed specific latency metrics for streaming STT, such as emission latency and time to complete transcript, particularly for AI voice agent applications.
Key takeaway
For MLOps Engineers evaluating speech-to-text models, relying solely on Word Error Rate (WER) and potentially flawed ground truth files can lead to selecting inferior models. You should integrate tools like AssemblyAI's Truth File Corrector to validate and improve your ground truth data. Additionally, adopt advanced metrics such as Semantic WER and Missed Entity Rate, and consider A/B testing models directly in production to measure their impact on actual user outcomes and business metrics, rather than just raw error rates.
Key insights
Traditional WER is flawed; superior STT models can expose errors in human-labeled ground truths.
Principles
- Not all transcription errors are equal.
- Ground truth quality directly impacts benchmark validity.
- Production A/B testing is a superior evaluation method.
Method
Correct ground truth files using a comparison tool, then evaluate STT models with a Python SDK incorporating Semantic WER, Missed Entity Rate, and LLM-based judging for nuanced performance insights.
In practice
- Use the Truth File Corrector to refine existing ground truths.
- Implement Semantic Word Lists for context-preserving substitutions.
- A/B test STT models in production to gauge real-world impact.
Topics
- STT Benchmarking
- Word Error Rate
- Truth File Correction
- Missed Entity Rate
- Universal 3 Pro
Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AssemblyAI.