NVMOS: Non-Verbal Vocalization Quality Assessment in Speech
Summary
NVMOS is introduced as the first model capable of reliably predicting the perceptual quality of non-verbal vocalizations (NVs) within speech, such as laughter, sighs, and coughs. Existing speech quality assessment methods often overlook the intrinsic quality of these NV events, focusing instead on overall naturalness or correct type/position. To address this gap, researchers constructed an NV-MOS dataset, featuring outputs from multiple NV-TTS systems and natural NV samples, with ratings from three acoustic experts. Analysis revealed that general-purpose multimodal large language models like Gemini exhibit clear inconsistencies with expert ratings, indicating their unsuitability for reliable NV quality assessment. NVMOS, utilizing a local NV-event focusing module, achieves expert-level or stronger agreement with human Mean Opinion Score (MOS).
Key takeaway
For machine learning engineers developing or evaluating Text-to-Speech (TTS) systems that incorporate non-verbal vocalizations, you should recognize that general multimodal large language models are insufficient for accurate quality assessment. Instead, focus on specialized models like NVMOS, which leverage local NV-event focusing modules and expert-rated datasets to achieve human-level perceptual quality prediction. This approach ensures more robust and perceptually aligned NV integration in your speech synthesis outputs.
Key insights
NVMOS reliably assesses non-verbal vocalization quality, outperforming general multimodal LLMs for this specific task.
Principles
- Non-verbal vocalizations are critical acoustic cues for emotion and intent.
- General multimodal LLMs cannot reliably replace human judgment for NV quality.
- Specialized models with event-focusing modules can achieve expert-level agreement.
Method
NVMOS predicts perceptual quality of NV events using a local NV-event focusing module, trained on an NV-MOS dataset of expert ratings.
In practice
- Develop NV-specific datasets with expert ratings.
- Incorporate local event-focusing modules in audio models.
- Avoid general LLMs for precise non-verbal quality evaluation.
Topics
- Non-Verbal Vocalizations
- Speech Quality Assessment
- NVMOS Model
- Text-to-Speech
- Multimodal LLMs
- Perceptual Quality
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.