Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Speech Processing · Depth: Expert, long

Summary

A study investigated human-model discrepancies in speech quality assessment, focusing on Mean Opinion Score (MOS) prediction models. Researchers applied controlled acoustic degradation, prosodic errors, and speaker-specific characteristic manipulations (pitch, speaking rate) to speech samples. Human listeners and six MOS prediction models, including SSL-MOS variants like SHEET-MB and UTMOS, provided ratings. Results showed models accurately track acoustic degradation, with system-level Spearman's rank correlation coefficients up to 0.964 for SHEET-BV. However, all models were insensitive to prosodic errors, showing less than 0.1 points of change despite a 1.84-point human MOS drop. Furthermore, models exhibited strong mean F₀ biases not present in human ratings, while being insensitive to speaking rate and F₀ variability that humans perceive. These findings underscore the limitations of current scalar MOS prediction models beyond basic acoustic fidelity.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating text-to-speech (TTS) systems, relying solely on current MOS prediction models for overall quality assessment is insufficient. These models consistently fail to capture critical prosodic errors and exhibit misaligned sensitivity to speaker characteristics, despite tracking acoustic degradation well. You should integrate human listening tests or develop specialized metrics for prosodic naturalness and speaker-specific attributes to ensure comprehensive and perceptually accurate TTS evaluation.

Key insights

Current MOS prediction models reliably track acoustic degradation but are insensitive to prosodic errors and misaligned on speaker characteristics.

Principles

MOS models prioritize signal-level acoustic quality.
Training data composition dictates acoustic degradation sensitivity.
Scalar MOS inherently discards multidimensional quality.

Method

Systematically compared human and model MOS ratings on speech samples perturbed with controlled acoustic degradation, pitch-accent errors, and manipulations of speaker F₀ and speaking rate.

In practice

Validate MOS model performance on specific quality dimensions.
Integrate explicit prosodic evaluation for TTS systems.
Select MOS models based on their training data domain.

Topics

Speech Quality Assessment
MOS Prediction Models
Text-to-Speech
Prosodic Errors
Acoustic Degradation
Speaker Characteristics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.