PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

Pitch-Accent-focused Speech Quality Assessment (PASQA) is a novel model designed to explicitly address the insensitivity of existing Mean Opinion Score (MOS) prediction models to localized pitch-accent errors in synthesized speech. Developed for Japanese, PASQA is trained on a scalable, synthetic accent-error dataset generated using a controllable text-to-speech (TTS) system, which modifies accent patterns and assigns pseudo accent-quality scores. The model leverages self-supervised representations and integrates four key strategies: mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training via a gradient reversal layer. Experiments demonstrate PASQA's superior ability to preserve accent-error severity ordering and its stronger agreement with human accent-correctness judgments, achieving a Spearman's rank correlation coefficient (SRCC) of 0.828 and Kendall's τ (KTAU) of 0.614. It also shows robust performance on out-of-domain TTS models like GPT-4o-mini-TTS.

Key takeaway

For Machine Learning Engineers evaluating Japanese text-to-speech systems, conventional utterance-level MOS models are insufficient for accurately assessing localized pitch-accent correctness. You should integrate specialized models like PASQA, which leverages synthetic accent-error data and architectural enhancements, to gain precise insights into prosodic quality. This approach provides stronger agreement with human judgments and robust performance on out-of-domain TTS, enabling more targeted quality improvements in your systems.

Key insights

PASQA accurately assesses pitch-accent correctness in synthetic speech by leveraging a specialized dataset and architectural enhancements.

Principles

Method

PASQA uses wav2vec 2.0 features, mora-conditioned fusion, pairwise logistic ranking loss, an auxiliary frame-level error detection head, and a gradient reversal layer for speaker-invariant training.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.