Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
Summary
A study investigated human-model discrepancies in speech quality assessment, focusing on Mean Opinion Score (MOS) prediction models. Researchers applied controlled acoustic degradation, prosodic errors, and speaker-specific characteristic manipulations (pitch, speaking rate) to speech samples. Human listeners and six MOS prediction models, including SSL-MOS variants like SHEET-MB and UTMOS, provided ratings. Results showed models accurately track acoustic degradation, with system-level Spearman's rank correlation coefficients up to 0.964 for SHEET-BV. However, all models were insensitive to prosodic errors, showing less than 0.1 points of change despite a 1.84-point human MOS drop. Furthermore, models exhibited strong mean F₀ biases not present in human ratings, while being insensitive to speaking rate and F₀ variability that humans perceive. These findings underscore the limitations of current scalar MOS prediction models beyond basic acoustic fidelity.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating text-to-speech (TTS) systems, relying solely on current MOS prediction models for overall quality assessment is insufficient. These models consistently fail to capture critical prosodic errors and exhibit misaligned sensitivity to speaker characteristics, despite tracking acoustic degradation well. You should integrate human listening tests or develop specialized metrics for prosodic naturalness and speaker-specific attributes to ensure comprehensive and perceptually accurate TTS evaluation.
Key insights
Current MOS prediction models reliably track acoustic degradation but are insensitive to prosodic errors and misaligned on speaker characteristics.
Principles
- MOS models prioritize signal-level acoustic quality.
- Training data composition dictates acoustic degradation sensitivity.
- Scalar MOS inherently discards multidimensional quality.
Method
Systematically compared human and model MOS ratings on speech samples perturbed with controlled acoustic degradation, pitch-accent errors, and manipulations of speaker F₀ and speaking rate.
In practice
- Validate MOS model performance on specific quality dimensions.
- Integrate explicit prosodic evaluation for TTS systems.
- Select MOS models based on their training data domain.
Topics
- Speech Quality Assessment
- MOS Prediction Models
- Text-to-Speech
- Prosodic Errors
- Acoustic Degradation
- Speaker Characteristics
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.