PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors
Summary
Pitch-Accent-focused Speech Quality Assessment (PASQA) is a novel model designed to explicitly address the insensitivity of existing Mean Opinion Score (MOS) prediction models to localized pitch-accent errors in synthesized speech. Developed for Japanese, PASQA is trained on a scalable, synthetic accent-error dataset generated using a controllable text-to-speech (TTS) system, which modifies accent patterns and assigns pseudo accent-quality scores. The model leverages self-supervised representations and integrates four key strategies: mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training via a gradient reversal layer. Experiments demonstrate PASQA's superior ability to preserve accent-error severity ordering and its stronger agreement with human accent-correctness judgments, achieving a Spearman's rank correlation coefficient (SRCC) of 0.828 and Kendall's τ (KTAU) of 0.614. It also shows robust performance on out-of-domain TTS models like GPT-4o-mini-TTS.
Key takeaway
For Machine Learning Engineers evaluating Japanese text-to-speech systems, conventional utterance-level MOS models are insufficient for accurately assessing localized pitch-accent correctness. You should integrate specialized models like PASQA, which leverages synthetic accent-error data and architectural enhancements, to gain precise insights into prosodic quality. This approach provides stronger agreement with human judgments and robust performance on out-of-domain TTS, enabling more targeted quality improvements in your systems.
Key insights
PASQA accurately assesses pitch-accent correctness in synthetic speech by leveraging a specialized dataset and architectural enhancements.
Principles
- Localized prosodic errors require targeted assessment models.
- Synthetic data with controlled errors can train specialized quality models.
- Self-supervised representations capture rich prosodic cues.
Method
PASQA uses wav2vec 2.0 features, mora-conditioned fusion, pairwise logistic ranking loss, an auxiliary frame-level error detection head, and a gradient reversal layer for speaker-invariant training.
In practice
- Generate controlled accent-error datasets with controllable TTS.
- Incorporate linguistic features like mora sequences for accent modeling.
- Use ranking loss for ordinal quality assessment.
Topics
- Speech Quality Assessment
- Pitch Accent
- Text-to-Speech
- Self-Supervised Learning
- Japanese Language Processing
- Accent Error Detection
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.