Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

2026-06-13 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on Large Language Models (LLMs) for second-language (L2) English pronunciation feedback reveals that their diagnoses are often driven by pretraining priors rather than supplied speech evidence. Researchers tested three audio-capable LLMs across 1,800 L2-Arctic utterances from six L1 backgrounds, evaluating four pronunciation dimensions under five evidence conditions. Key findings indicate a decoupling of rating accuracy and grounded reasoning, with 39.6% of judgments showing coherent but incorrect reasoning versus 15.8% for correct reasoning. Phoneme-level feedback consistently identified a fixed set of L2-English difficulty phones, irrespective of L1 background or evidence type. Crucially, acoustic evidence improved ratings only when directly probing the target dimension; for instance, textualized F0 range boosted pitch-variation grounding from 0.18-0.19 to 0.45-0.62, while raw audio alone did not. This suggests LLMs are more effective as verbalizers of externally computed evidence than as independent diagnostic tools.

Key takeaway

For NLP engineers developing L2 pronunciation feedback systems, recognize that current LLMs prioritize pretraining stereotypes over actual speech evidence. You should integrate external acoustic feature extractors for explicit, targeted evidence, avoiding reliance on raw audio or general LLM capabilities. This approach improves diagnostic grounding, as seen with F0 range, and mitigates the risk of coherent but incorrect feedback. Validate LLM outputs rigorously against gold labels.

Key insights

LLMs often prioritize pretraining stereotypes over actual acoustic evidence in L2 pronunciation diagnosis.

Principles

LLM reasoning can be coherent but incorrect.
LLM feedback may reflect pretraining priors.
Direct feature input improves LLM grounding.

Method

The study evaluated LLM pronunciation feedback using 1,800 L2-Arctic utterances, 6 L1s, 3 LLMs, 4 dimensions, and 5 evidence conditions, scoring Rating Accuracy, Evidence Coherence, and Grounded Correctness.

In practice

Pre-process acoustic features for LLM input.
Validate LLM diagnoses against ground truth.
Avoid LLMs as standalone diagnostic engines.

Topics

Large Language Models
L2 Pronunciation Feedback
Acoustic Features
Stereotype Bias
Speech Diagnostics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.