Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback
Summary
A study on Large Language Models (LLMs) for second-language (L2) English pronunciation feedback reveals that their diagnoses are often driven by pretraining priors rather than supplied speech evidence. Researchers tested three audio-capable LLMs across 1,800 L2-Arctic utterances from six L1 backgrounds, evaluating four pronunciation dimensions under five evidence conditions. Key findings indicate a decoupling of rating accuracy and grounded reasoning, with 39.6% of judgments showing coherent but incorrect reasoning versus 15.8% for correct reasoning. Phoneme-level feedback consistently identified a fixed set of L2-English difficulty phones, irrespective of L1 background or evidence type. Crucially, acoustic evidence improved ratings only when directly probing the target dimension; for instance, textualized F0 range boosted pitch-variation grounding from 0.18-0.19 to 0.45-0.62, while raw audio alone did not. This suggests LLMs are more effective as verbalizers of externally computed evidence than as independent diagnostic tools.
Key takeaway
For NLP engineers developing L2 pronunciation feedback systems, recognize that current LLMs prioritize pretraining stereotypes over actual speech evidence. You should integrate external acoustic feature extractors for explicit, targeted evidence, avoiding reliance on raw audio or general LLM capabilities. This approach improves diagnostic grounding, as seen with F0 range, and mitigates the risk of coherent but incorrect feedback. Validate LLM outputs rigorously against gold labels.
Key insights
LLMs often prioritize pretraining stereotypes over actual acoustic evidence in L2 pronunciation diagnosis.
Principles
- LLM reasoning can be coherent but incorrect.
- LLM feedback may reflect pretraining priors.
- Direct feature input improves LLM grounding.
Method
The study evaluated LLM pronunciation feedback using 1,800 L2-Arctic utterances, 6 L1s, 3 LLMs, 4 dimensions, and 5 evidence conditions, scoring Rating Accuracy, Evidence Coherence, and Grounded Correctness.
In practice
- Pre-process acoustic features for LLM input.
- Validate LLM diagnoses against ground truth.
- Avoid LLMs as standalone diagnostic engines.
Topics
- Large Language Models
- L2 Pronunciation Feedback
- Acoustic Features
- Stereotype Bias
- Speech Diagnostics
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.