Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on Large Language Models (LLMs) for second-language (L2) English pronunciation feedback reveals that their diagnoses are often driven by pretraining priors rather than supplied speech evidence. Researchers tested three audio-capable LLMs across 1,800 L2-Arctic utterances from six L1 backgrounds, evaluating four pronunciation dimensions under five evidence conditions. Key findings indicate a decoupling of rating accuracy and grounded reasoning, with 39.6% of judgments showing coherent but incorrect reasoning versus 15.8% for correct reasoning. Phoneme-level feedback consistently identified a fixed set of L2-English difficulty phones, irrespective of L1 background or evidence type. Crucially, acoustic evidence improved ratings only when directly probing the target dimension; for instance, textualized F0 range boosted pitch-variation grounding from 0.18-0.19 to 0.45-0.62, while raw audio alone did not. This suggests LLMs are more effective as verbalizers of externally computed evidence than as independent diagnostic tools.

Key takeaway

For NLP engineers developing L2 pronunciation feedback systems, recognize that current LLMs prioritize pretraining stereotypes over actual speech evidence. You should integrate external acoustic feature extractors for explicit, targeted evidence, avoiding reliance on raw audio or general LLM capabilities. This approach improves diagnostic grounding, as seen with F0 range, and mitigates the risk of coherent but incorrect feedback. Validate LLM outputs rigorously against gold labels.

Key insights

LLMs often prioritize pretraining stereotypes over actual acoustic evidence in L2 pronunciation diagnosis.

Principles

Method

The study evaluated LLM pronunciation feedback using 1,800 L2-Arctic utterances, 6 L1s, 3 LLMs, 4 dimensions, and 5 evidence conditions, scoring Rating Accuracy, Evidence Coherence, and Grounded Correctness.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.