Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

2026-06-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Processing · Depth: Expert, long

Summary

A new lightweight pronunciation assessment framework is introduced, designed to operate with minimal or no labeled learner data by training exclusively on native speech resources. The system discretizes learner speech using a self-supervised learning (SSL) encoder and a K-means codebook, then employs a token language model (TLM) to calculate surprisal, indicating phonotactic deviations. An optional transcript-guided Text2DUnit–DTW module predicts canonical native token sequences from reference text and aligns them with acoustic tokens to generate error-sensitive features. These surprisal and alignment features are combined via simple regression. On the SpeechOcean762 dataset, the framework achieved a Pearson Correlation Coefficient (PCC) improvement from 0.60 to 0.66 with transcript guidance, approaching supervised baseline performance. Cross-dataset evaluation on L2-ARCTIC also demonstrated consistent gains, and the system proved robust even with an order-of-magnitude reduction in native training data.

Key takeaway

For Machine Learning Engineers or NLP Engineers building pronunciation assessment systems, especially for low-resource languages or specialized speaking styles, this framework offers a compelling solution. You can achieve competitive performance by training solely on native speech, significantly reducing the need for costly labeled learner data, phoneme inventories, or forced alignment. Implement this approach to develop robust computer-assisted language learning tools more efficiently, leveraging transcript guidance for enhanced accuracy and generalization across diverse datasets.

Key insights

Lightweight pronunciation assessment uses native speech token surprisal and text-guided alignment, avoiding costly labeled learner data.

Principles

Native speech defines phonotactic norms.
Discrete tokens abstract acoustic space.
Surprisal measures phonotactic deviation.

Method

Discretize speech via SSL encoder and K-means. Train n-gram TLM on native tokens. Predict native token sequences from text using Text2DUnit. Align learner acoustic tokens with text-derived tokens via DTW. Fuse surprisal and alignment features with regression.

In practice

Use HuBERT base Layer 9 for SSL features.
Fit K-means codebook with K=512.
Train a 3-gram token language model.

Topics

Pronunciation Assessment
Discrete Speech Tokens
Self-supervised Learning
Token Surprisal
Dynamic Time Warping
Computer-Assisted Language Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.