LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Educational Psychology & Learning Sciences · Depth: Expert, quick

Summary

A study evaluated 42 proprietary and open-weight large language models (LLMs) on their ability to measure item discrimination, a key psychometric property indicating how well an assessment item distinguishes students of varying proficiency. The research employed two zero-shot approaches: direct discrimination prediction, where LLMs estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, which uses LLM-generated answers as synthetic student responses to compute scores. Results showed weak alignment with human-calibrated discrimination. Direct prediction achieved a Spearman correlation of only 0.152 with the best model. Response-based CTT calibration provided a stronger but still limited signal, reaching a Spearman correlation of 0.241 with an all-persona synthetic respondent pool. These findings indicate that while LLMs contain some discrimination-relevant signal, they do not yet reliably capture how assessment items differentiate human students.

Key takeaway

For AI scientists developing LLM-based educational assessment tools, you should recognize current models' limitations in accurately measuring item discrimination. While LLMs can estimate item difficulty, their ability to reliably distinguish student proficiency levels remains an open challenge. Focus research on improving LLM alignment with human psychometric properties, particularly for nuanced metrics like discrimination, before deploying them in high-stakes assessment contexts.

Key insights

LLMs struggle to reliably measure item discrimination, a key psychometric property for distinguishing student proficiency.

Principles

Method

The study used direct discrimination prediction and response-based Classical Test Theory (CTT) calibration, treating LLM answers as synthetic student responses to compute discrimination scores.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.