Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

2026-06-18 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Processing · Depth: Expert, quick

Summary

Mean opinion score (MOS) prediction models, widely used in text-to-speech (TTS) research as proxy metrics, exhibit significant discrepancies compared to human listeners in assessing speech quality beyond acoustic fidelity. A study investigated this by applying controlled perturbations, including acoustic degradation, prosodic errors, and manipulations of speaker-specific characteristics like pitch (F0) and speaking rate. While most models accurately track acoustic degradation, they are universally insensitive to prosodic errors, which humans perceive as large subjective quality drops. Furthermore, models show strong mean F0 biases not present in human ratings, yet fail to detect changes in speaking rate and F0 variability that humans readily notice. These findings underscore the current limitations of scalar MOS prediction models in comprehensively evaluating speech quality.

Key takeaway

For machine learning engineers developing text-to-speech (TTS) systems, you should recognize that current MOS prediction models are insufficient for comprehensively evaluating speech quality beyond basic acoustic fidelity. Do not solely rely on these models to assess prosodic accuracy or speaker-specific characteristics like pitch variability and speaking rate, as they are insensitive to errors humans readily detect. Instead, integrate human evaluation or develop specialized metrics to capture these critical aspects of natural speech.

Key insights

Current MOS prediction models fail to capture human perception of prosodic errors and speaker characteristics.

Principles

MOS models prioritize acoustic fidelity.
Human perception integrates prosody and speaker traits.
Model biases exist for mean fundamental frequency (F0).

Method

Investigating human-model discrepancies via controlled acoustic, prosodic, and speaker-specific characteristic perturbations on speech samples.

Topics

Mean Opinion Score
Text-to-Speech
Speech Quality Assessment
Prosodic Errors
Speaker Characteristics
Acoustic Degradation

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.