LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
Summary
A study evaluated 42 proprietary and open-weight large language models (LLMs) on their ability to measure item discrimination, a key psychometric property indicating how well an assessment item distinguishes students of varying proficiency. The research employed two zero-shot approaches: direct discrimination prediction, where LLMs estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, which uses LLM-generated answers as synthetic student responses to compute scores. Results showed weak alignment with human-calibrated discrimination. Direct prediction achieved a Spearman correlation of only 0.152 with the best model. Response-based CTT calibration provided a stronger but still limited signal, reaching a Spearman correlation of 0.241 with an all-persona synthetic respondent pool. These findings indicate that while LLMs contain some discrimination-relevant signal, they do not yet reliably capture how assessment items differentiate human students.
Key takeaway
For AI scientists developing LLM-based educational assessment tools, you should recognize current models' limitations in accurately measuring item discrimination. While LLMs can estimate item difficulty, their ability to reliably distinguish student proficiency levels remains an open challenge. Focus research on improving LLM alignment with human psychometric properties, particularly for nuanced metrics like discrimination, before deploying them in high-stakes assessment contexts.
Key insights
LLMs struggle to reliably measure item discrimination, a key psychometric property for distinguishing student proficiency.
Principles
- Item discrimination is fundamental for educational assessment quality.
- LLMs show weak correlation with human-calibrated discrimination.
- Direct prediction yields weaker signals than response-based CTT.
Method
The study used direct discrimination prediction and response-based Classical Test Theory (CTT) calibration, treating LLM answers as synthetic student responses to compute discrimination scores.
In practice
- Evaluate LLM psychometric capabilities beyond item difficulty.
- Explore synthetic student responses for assessment analysis.
Topics
- Large Language Models
- Psychometrics
- Item Discrimination
- Educational Assessment
- Classical Test Theory
- Reading Comprehension
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.