LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
Summary
This study evaluates 42 large language models (LLMs) for estimating item discrimination in reading comprehension assessments. Item discrimination measures how well an item distinguishes students by proficiency. Researchers used the Cambridge Multiple-Choice Questions Reading Dataset. Two zero-shot approaches were tested: direct discrimination prediction and response-based Classical Test Theory (CTT) calibration. Both methods included proficiency-conditioned student personas. Direct prediction showed weak alignment, with the best model achieving a Spearman correlation of 0.152. Response-based CTT calibration offered a stronger, yet limited, signal at 0.241. LLMs also produced compressed discrimination distributions. These findings indicate current LLMs do not reliably capture human item discrimination, posing a challenge for LLM-based psychometric evaluation.
Key takeaway
For NLP Engineers developing educational assessment tools, recognize that current LLMs are insufficient for reliably estimating item discrimination. While LLMs predict item difficulty, their ability to differentiate student proficiency remains limited. This holds true even with persona prompting. You must prioritize collecting human response data for accurate psychometric analysis. Especially for high-stakes assessments, consider advanced student error simulation techniques. Do not rely on direct LLM judgments or basic response generation for discrimination insights.
Key insights
Current LLMs struggle to reliably estimate item discrimination, a key psychometric property, from reading comprehension content.
Principles
- LLM general reasoning ability does not imply human assessment behavior alignment.
- Item discrimination requires modeling ability-conditioned response patterns, not just overall difficulty.
- Direct LLM judgments of psychometric properties show weak human alignment.
Method
Item discrimination can be estimated via direct prediction (LLM judges discrimination value) or response-based CTT calibration (LLM answers, then discrimination is computed from synthetic responses).
In practice
- Use human response data for reliable item discrimination analysis.
- Explore LLM-based student error simulation for more faithful psychometric proxies.
- Do not rely on LLMs for direct, high-fidelity item discrimination prediction.
Topics
- Large Language Models
- Educational Assessment
- Item Discrimination
- Psychometrics
- Reading Comprehension
- Classical Test Theory
- Student Simulation
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.