PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Processing · Depth: Expert, quick

Summary

PASQA, a novel Pitch-Accent-focused Speech Quality Assessment model, addresses the insensitivity of traditional mean opinion score (MOS) prediction models to localized pitch-accent errors in synthetic speech. Developed to explicitly target pitch-accent correctness, PASQA was trained using a controlled Japanese accent-error dataset. This dataset was generated by modifying accent patterns via an accent-controllable text-to-speech system, with a pseudo accent-quality score derived from the accent-error rate. The model integrates self-supervised representations, mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experimental results demonstrate PASQA's superior performance, achieving high ordering accuracy on both seen and unseen speakers, a task where conventional models fail to preserve accent-error severity ordering. Furthermore, PASQA exhibits stronger agreement with human accent-correctness judgments. The model's code is publicly available.

Key takeaway

For NLP Engineers or AI Scientists developing text-to-speech systems, especially for pitch-accent languages like Japanese, you should integrate PASQA into your quality assessment pipeline. This model provides superior, fine-grained evaluation of pitch-accent correctness, which traditional MOS models overlook. By adopting PASQA, you can ensure higher fidelity and naturalness in your synthetic speech output, directly addressing a critical aspect of perceived speech quality.

Key insights

PASQA explicitly targets pitch-accent correctness, outperforming general MOS models in localized error detection.

Principles

Method

PASQA trains on a Japanese accent-error dataset, generated by an accent-controllable TTS system, using pseudo accent-quality scores. It employs self-supervised representations, mora-conditioned fusion, ranking loss, and an auxiliary accent-error localization task.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.