Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Researchers have developed a statistically principled method to identify assessment items that function differently for human learners and large language model (LLM) chatbots. This method, combining educational data mining and psychometric theory, adapts Differential Item Functioning (DIF) analysis, traditionally used for detecting bias across demographic groups, to compare human and AI responses. It integrates negative control analysis and item-total correlation discrimination analysis to pinpoint where assessments are vulnerable to AI misuse and which task dimensions challenge generative AI. The approach was evaluated using responses from 931 human students and six leading chatbots (ChatGPT-4o & 5.2, Gemini 1.5 & 3 Pro, Claude 3.5 & 4.5 Sonnet) on a 22-item high school chemistry diagnostic test and a 40-item university entrance exam. Subject-matter experts then analyzed DIF-flagged items, revealing that chatbots excel in rule-governed reasoning but struggle with visual interpretation, linguistic nuance, and complex multi-step procedures.

Key takeaway

For AI Scientists and Research Scientists designing educational assessments, this research provides a robust method to identify items where LLMs and humans perform differently. You should integrate Logistic Regression DIF (LR-DIF) analysis, filtered by Item-Total Correlation (ITC) $\geq$ 0.2, into your assessment design pipeline. This will help you understand specific task dimensions that make problems easy or difficult for generative AI, enabling the creation of more valid, reliable, and fair assessments in an AI-integrated educational environment.

Key insights

DIF analysis can reliably identify assessment items where human and chatbot performance systematically diverge, controlling for overall ability.

Principles

DIF analysis controls for overall ability.
LR-DIF is more stable than MH-DIF.
Item-Total Correlation (ITC) filters unstable items.

Method

The method applies Logistic Regression DIF (LR-DIF) to identify items with differential functioning between humans and chatbots, validated by negative control analysis, and filtered by Item-Total Correlation (ITC) values $\geq$ 0.2.

In practice

Use LR-DIF to detect AI-specific item vulnerabilities.
Filter DIF results using ITC $\geq$ 0.2 for reliability.
Analyze DIF-flagged items with domain experts.

Topics

Assessment Design
Generative AI
Differential Item Functioning
Psychometric Theory
Large Language Models

Best for: AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.