NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Audio and Speech Processing · Depth: Expert, quick

Summary

NVMOS is introduced as the first model capable of reliably predicting the perceptual quality of non-verbal vocalizations (NVs) within speech, such as laughter, sighs, and coughs. Existing speech quality assessment methods often overlook the intrinsic quality of these NV events, focusing instead on overall naturalness or correct type/position. To address this gap, researchers constructed an NV-MOS dataset, featuring outputs from multiple NV-TTS systems and natural NV samples, with ratings from three acoustic experts. Analysis revealed that general-purpose multimodal large language models like Gemini exhibit clear inconsistencies with expert ratings, indicating their unsuitability for reliable NV quality assessment. NVMOS, utilizing a local NV-event focusing module, achieves expert-level or stronger agreement with human Mean Opinion Score (MOS).

Key takeaway

For machine learning engineers developing or evaluating Text-to-Speech (TTS) systems that incorporate non-verbal vocalizations, you should recognize that general multimodal large language models are insufficient for accurate quality assessment. Instead, focus on specialized models like NVMOS, which leverage local NV-event focusing modules and expert-rated datasets to achieve human-level perceptual quality prediction. This approach ensures more robust and perceptually aligned NV integration in your speech synthesis outputs.

Key insights

NVMOS reliably assesses non-verbal vocalization quality, outperforming general multimodal LLMs for this specific task.

Principles

Method

NVMOS predicts perceptual quality of NV events using a local NV-event focusing module, trained on an NV-MOS dataset of expert ratings.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.