LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
Summary
A recent study evaluated 42 proprietary and open-weight large language models (LLMs) on their ability to measure item discrimination, a key psychometric property in educational assessment that gauges how well an item differentiates students of varying proficiency. Conducted in zero-shot settings, the research employed two methods: direct discrimination prediction, where LLMs estimated an item's discrimination value, and response-based Classical Test Theory (CTT) calibration, which used LLM-generated answers as synthetic student responses. The findings indicate that direct prediction showed weak alignment with human-calibrated discrimination, with the top-performing model achieving only a Spearman correlation of 0.152. While response-based CTT calibration provided a stronger signal, its effectiveness remained limited, reaching a Spearman correlation of 0.241 for the all-persona synthetic respondent pool. This highlights item discrimination as a significant challenge for LLM-based psychometric evaluation.
Key takeaway
For psychometricians or NLP engineers developing automated educational assessment tools, you should exercise caution when relying on LLMs for evaluating item discrimination. Current models do not reliably capture how assessment items differentiate student proficiency. Prioritize human-calibrated methods or robust Classical Test Theory (CTT) approaches with synthetic responses, rather than direct LLM predictions, to ensure the validity and fairness of your assessment items.
Key insights
LLMs struggle to reliably measure item discrimination, a fundamental psychometric property in educational assessment.
Principles
- Item discrimination differentiates student proficiency levels.
- LLMs possess some discrimination-relevant signal.
- Response-based CTT calibration outperforms direct prediction.
Method
The study used two zero-shot approaches: direct discrimination prediction from item content and response-based Classical Test Theory (CTT) calibration using LLM-generated answers as synthetic student responses.
In practice
- Do not rely solely on LLM-based discrimination scores.
- Incorporate human calibration for psychometric validation.
- Explore CTT calibration over direct prediction for LLMs.
Topics
- Large Language Models
- Item Discrimination
- Psychometric Evaluation
- Reading Comprehension Assessment
- Classical Test Theory
- Zero-shot Learning
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.