Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal
Summary
A new lightweight framework for automated pronunciation assessment has been developed, addressing the high cost associated with collecting labeled learner errors or non-native speech corpora. This system is trained exclusively on native speech resources, operating either unsupervised or with minimal calibration using a small set of scored utterances. During inference, learner speech is discretized via an SSL encoder and a K-means codebook. A token language model, trained on native sequences, then calculates "surprisal," where elevated surprisal values signify phonotactic deviation. The framework incorporates a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens, generating error-sensitive features. These surprisal and alignment features are combined through simple regression. Performance on SpeechOcean762 demonstrated a PCC improvement from 0.60 to 0.66 with transcript guidance, nearing supervised baseline results, and consistent gains were observed in cross-dataset evaluation on L2-ARCTIC.
Key takeaway
For NLP Engineers developing automated pronunciation assessment systems, especially when facing high costs for non-native labeled data, you should consider adopting a native-speech-only training paradigm. This approach, leveraging discrete speech token surprisal and transcript-guided alignment, significantly improves Pearson Correlation Coefficient (PCC) to 0.66 on SpeechOcean762, nearing supervised baselines. You can reduce data collection burdens while achieving robust performance for your language learning applications.
Key insights
Lightweight pronunciation assessment can be achieved by analyzing discrete speech token surprisal from native speech models, reducing reliance on costly labeled non-native data.
Principles
- Training on native speech reduces data costs.
- Higher token surprisal indicates phonotactic deviation.
- Transcript guidance improves assessment accuracy.
Method
Learner speech is discretized using an SSL encoder and K-means codebook. A native-trained token language model computes surprisal. A Text2DUnit--DTW module aligns reference text to acoustic tokens. Surprisal and alignment features are fused via regression.
In practice
- Develop pronunciation tools with less data.
- Integrate SSL encoders for speech discretization.
- Use DTW for text-to-acoustic alignment.
Topics
- Pronunciation Assessment
- Discrete Speech Tokens
- Speech Token Surprisal
- SSL Encoders
- Dynamic Time Warping
- L2-ARCTIC
Best for: Research Scientist, NLP Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.