PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors
Summary
PASQA, a novel Pitch-Accent-focused Speech Quality Assessment model, addresses the insensitivity of traditional mean opinion score (MOS) prediction models to localized pitch-accent errors in synthetic speech. Developed to explicitly target pitch-accent correctness, PASQA was trained using a controlled Japanese accent-error dataset. This dataset was generated by modifying accent patterns via an accent-controllable text-to-speech system, with a pseudo accent-quality score derived from the accent-error rate. The model integrates self-supervised representations, mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experimental results demonstrate PASQA's superior performance, achieving high ordering accuracy on both seen and unseen speakers, a task where conventional models fail to preserve accent-error severity ordering. Furthermore, PASQA exhibits stronger agreement with human accent-correctness judgments. The model's code is publicly available.
Key takeaway
For NLP Engineers or AI Scientists developing text-to-speech systems, especially for pitch-accent languages like Japanese, you should integrate PASQA into your quality assessment pipeline. This model provides superior, fine-grained evaluation of pitch-accent correctness, which traditional MOS models overlook. By adopting PASQA, you can ensure higher fidelity and naturalness in your synthetic speech output, directly addressing a critical aspect of perceived speech quality.
Key insights
PASQA explicitly targets pitch-accent correctness, outperforming general MOS models in localized error detection.
Principles
- Utterance-level MOS models often miss localized pitch-accent errors.
- Training on controlled synthetic accent errors improves accent quality assessment.
- Speaker-invariant training enhances model generalization.
Method
PASQA trains on a Japanese accent-error dataset, generated by an accent-controllable TTS system, using pseudo accent-quality scores. It employs self-supervised representations, mora-conditioned fusion, ranking loss, and an auxiliary accent-error localization task.
In practice
- Use PASQA for fine-grained pitch-accent quality assessment in TTS.
- Leverage accent-controllable TTS for synthetic error dataset generation.
- Integrate speaker-invariant training for robust speech quality models.
Topics
- Speech Quality Assessment
- Pitch Accent
- Text-to-Speech
- Accent Errors
- Self-supervised Learning
- Japanese Language
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.