LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

· Source: Takara TLDR - Daily AI Papers · Field: Education & Learning — Educational Technology (EdTech), Academic Research & Higher Education, Educational Psychology & Learning Sciences · Depth: Expert, medium

Summary

A recent study evaluated 42 proprietary and open-weight large language models (LLMs) on their ability to measure item discrimination, a key psychometric property in educational assessment that gauges how well an item differentiates students of varying proficiency. Conducted in zero-shot settings, the research employed two methods: direct discrimination prediction, where LLMs estimated an item's discrimination value, and response-based Classical Test Theory (CTT) calibration, which used LLM-generated answers as synthetic student responses. The findings indicate that direct prediction showed weak alignment with human-calibrated discrimination, with the top-performing model achieving only a Spearman correlation of 0.152. While response-based CTT calibration provided a stronger signal, its effectiveness remained limited, reaching a Spearman correlation of 0.241 for the all-persona synthetic respondent pool. This highlights item discrimination as a significant challenge for LLM-based psychometric evaluation.

Key takeaway

For psychometricians or NLP engineers developing automated educational assessment tools, you should exercise caution when relying on LLMs for evaluating item discrimination. Current models do not reliably capture how assessment items differentiate student proficiency. Prioritize human-calibrated methods or robust Classical Test Theory (CTT) approaches with synthetic responses, rather than direct LLM predictions, to ensure the validity and fairness of your assessment items.

Key insights

LLMs struggle to reliably measure item discrimination, a fundamental psychometric property in educational assessment.

Principles

Method

The study used two zero-shot approaches: direct discrimination prediction from item content and response-based Classical Test Theory (CTT) calibration using LLM-generated answers as synthetic student responses.

In practice

Topics

Best for: AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.