Evaluation of Automatic Speech Recognition Using Generative Large Language Models

2026-04-23 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new study evaluates the use of generative Large Language Models (LLMs) for Automatic Speech Recognition (ASR) evaluation, aiming to overcome the limitations of the traditional Word Error Rate (WER) metric, which is insensitive to meaning. The research explores three approaches: selecting the better of two hypotheses, calculating semantic distance via generative embeddings, and qualitatively classifying errors. On the HATS dataset, the top-performing LLMs achieved 92-94% agreement with human annotators for hypothesis selection, significantly surpassing WER's 63% agreement and outperforming other semantic metrics. The study also found that embeddings from decoder-based LLMs performed comparably to encoder models, indicating a promising path for more interpretable and semantically aware ASR evaluation.

Key takeaway

For AI Engineers and Research Scientists evaluating ASR systems, integrating generative LLMs into your evaluation pipeline can provide a more human-aligned and semantically sensitive assessment than relying solely on WER. Consider implementing LLM-based hypothesis selection or semantic distance calculations to gain deeper insights into ASR performance and error types, potentially leading to more robust model improvements.

Key insights

Generative LLMs significantly improve ASR evaluation by aligning better with human perception than traditional WER.

Principles

Semantic metrics correlate better with human perception.
Decoder-based LLM embeddings perform comparably to encoder models.

Method

Evaluated generative LLMs for ASR via hypothesis selection, semantic distance computation using generative embeddings, and qualitative error classification on the HATS dataset.

In practice

Use LLMs for ASR hypothesis selection.
Explore generative embeddings for semantic distance.

Topics

Automatic Speech Recognition
Large Language Models
ASR Evaluation
Word Error Rate
Semantic Metrics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.