Your Ground Truth Is Wrong: Evaluating STT with truth files & semantic WER | AssemblyAI Workshop

2026-03-31 · Source: AssemblyAI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

AssemblyAI's Applied AI team, led by Zach and David, presented a workshop on improving speech-to-text (STT) benchmarking, highlighting critical flaws in the traditional Word Error Rate (WER) metric. They introduced Universal 3 Pro, their newest model, which demonstrated such high accuracy that it exposed errors in human-labeled ground truth files, leading to misleadingly worse WER scores for the superior model. To address this, AssemblyAI launched the "Truth File Corrector" tool, enabling users to manually review and update ground truth files by comparing them against Universal 3 Pro's transcriptions. The workshop also covered the Speech-to-Text Benchmarking SDK, which includes advanced metrics like Semantic WER and Missed Entity Rate (MER), and demonstrated using Large Language Models (LLMs) as judges for more nuanced evaluation. They emphasized the importance of A/B testing STT models in production and discussed specific latency metrics for streaming STT, such as emission latency and time to complete transcript, particularly for AI voice agent applications.

Key takeaway

For MLOps Engineers evaluating speech-to-text models, relying solely on Word Error Rate (WER) and potentially flawed ground truth files can lead to selecting inferior models. You should integrate tools like AssemblyAI's Truth File Corrector to validate and improve your ground truth data. Additionally, adopt advanced metrics such as Semantic WER and Missed Entity Rate, and consider A/B testing models directly in production to measure their impact on actual user outcomes and business metrics, rather than just raw error rates.

Key insights

Traditional WER is flawed; superior STT models can expose errors in human-labeled ground truths.

Principles

Not all transcription errors are equal.
Ground truth quality directly impacts benchmark validity.
Production A/B testing is a superior evaluation method.

Method

Correct ground truth files using a comparison tool, then evaluate STT models with a Python SDK incorporating Semantic WER, Missed Entity Rate, and LLM-based judging for nuanced performance insights.

In practice

Use the Truth File Corrector to refine existing ground truths.
Implement Semantic Word Lists for context-preserving substitutions.
A/B test STT models in production to gauge real-world impact.

Topics

STT Benchmarking
Word Error Rate
Truth File Correction
Missed Entity Rate
Universal 3 Pro

Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AssemblyAI.