LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

· Source: cs.CL updates on arXiv.org · Field: Education & Learning — Educational Technology (EdTech), Educational Psychology & Learning Sciences, Academic Research & Higher Education · Depth: Expert, extended

Summary

This study evaluates 42 large language models (LLMs) for estimating item discrimination in reading comprehension assessments. Item discrimination measures how well an item distinguishes students by proficiency. Researchers used the Cambridge Multiple-Choice Questions Reading Dataset. Two zero-shot approaches were tested: direct discrimination prediction and response-based Classical Test Theory (CTT) calibration. Both methods included proficiency-conditioned student personas. Direct prediction showed weak alignment, with the best model achieving a Spearman correlation of 0.152. Response-based CTT calibration offered a stronger, yet limited, signal at 0.241. LLMs also produced compressed discrimination distributions. These findings indicate current LLMs do not reliably capture human item discrimination, posing a challenge for LLM-based psychometric evaluation.

Key takeaway

For NLP Engineers developing educational assessment tools, recognize that current LLMs are insufficient for reliably estimating item discrimination. While LLMs predict item difficulty, their ability to differentiate student proficiency remains limited. This holds true even with persona prompting. You must prioritize collecting human response data for accurate psychometric analysis. Especially for high-stakes assessments, consider advanced student error simulation techniques. Do not rely on direct LLM judgments or basic response generation for discrimination insights.

Key insights

Current LLMs struggle to reliably estimate item discrimination, a key psychometric property, from reading comprehension content.

Principles

Method

Item discrimination can be estimated via direct prediction (LLM judges discrimination value) or response-based CTT calibration (LLM answers, then discrimination is computed from synthetic responses).

In practice

Topics

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.