Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal
Summary
A new lightweight pronunciation assessment framework is introduced, designed to operate with minimal or no labeled learner data by training exclusively on native speech resources. The system discretizes learner speech using a self-supervised learning (SSL) encoder and a K-means codebook, then employs a token language model (TLM) to calculate surprisal, indicating phonotactic deviations. An optional transcript-guided Text2DUnit–DTW module predicts canonical native token sequences from reference text and aligns them with acoustic tokens to generate error-sensitive features. These surprisal and alignment features are combined via simple regression. On the SpeechOcean762 dataset, the framework achieved a Pearson Correlation Coefficient (PCC) improvement from 0.60 to 0.66 with transcript guidance, approaching supervised baseline performance. Cross-dataset evaluation on L2-ARCTIC also demonstrated consistent gains, and the system proved robust even with an order-of-magnitude reduction in native training data.
Key takeaway
For Machine Learning Engineers or NLP Engineers building pronunciation assessment systems, especially for low-resource languages or specialized speaking styles, this framework offers a compelling solution. You can achieve competitive performance by training solely on native speech, significantly reducing the need for costly labeled learner data, phoneme inventories, or forced alignment. Implement this approach to develop robust computer-assisted language learning tools more efficiently, leveraging transcript guidance for enhanced accuracy and generalization across diverse datasets.
Key insights
Lightweight pronunciation assessment uses native speech token surprisal and text-guided alignment, avoiding costly labeled learner data.
Principles
- Native speech defines phonotactic norms.
- Discrete tokens abstract acoustic space.
- Surprisal measures phonotactic deviation.
Method
Discretize speech via SSL encoder and K-means. Train n-gram TLM on native tokens. Predict native token sequences from text using Text2DUnit. Align learner acoustic tokens with text-derived tokens via DTW. Fuse surprisal and alignment features with regression.
In practice
- Use HuBERT base Layer 9 for SSL features.
- Fit K-means codebook with K=512.
- Train a 3-gram token language model.
Topics
- Pronunciation Assessment
- Discrete Speech Tokens
- Self-supervised Learning
- Token Surprisal
- Dynamic Time Warping
- Computer-Assisted Language Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.