Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
Summary
Researchers have developed a statistically principled method to identify assessment items that function differently for human learners and large language model (LLM) chatbots. This method, combining educational data mining and psychometric theory, adapts Differential Item Functioning (DIF) analysis, traditionally used for detecting bias across demographic groups, to compare human and AI responses. It integrates negative control analysis and item-total correlation discrimination analysis to pinpoint where assessments are vulnerable to AI misuse and which task dimensions challenge generative AI. The approach was evaluated using responses from 931 human students and six leading chatbots (ChatGPT-4o & 5.2, Gemini 1.5 & 3 Pro, Claude 3.5 & 4.5 Sonnet) on a 22-item high school chemistry diagnostic test and a 40-item university entrance exam. Subject-matter experts then analyzed DIF-flagged items, revealing that chatbots excel in rule-governed reasoning but struggle with visual interpretation, linguistic nuance, and complex multi-step procedures.
Key takeaway
For AI Scientists and Research Scientists designing educational assessments, this research provides a robust method to identify items where LLMs and humans perform differently. You should integrate Logistic Regression DIF (LR-DIF) analysis, filtered by Item-Total Correlation (ITC) $\geq$ 0.2, into your assessment design pipeline. This will help you understand specific task dimensions that make problems easy or difficult for generative AI, enabling the creation of more valid, reliable, and fair assessments in an AI-integrated educational environment.
Key insights
DIF analysis can reliably identify assessment items where human and chatbot performance systematically diverge, controlling for overall ability.
Principles
- DIF analysis controls for overall ability.
- LR-DIF is more stable than MH-DIF.
- Item-Total Correlation (ITC) filters unstable items.
Method
The method applies Logistic Regression DIF (LR-DIF) to identify items with differential functioning between humans and chatbots, validated by negative control analysis, and filtered by Item-Total Correlation (ITC) values $\geq$ 0.2.
In practice
- Use LR-DIF to detect AI-specific item vulnerabilities.
- Filter DIF results using ITC $\geq$ 0.2 for reliability.
- Analyze DIF-flagged items with domain experts.
Topics
- Assessment Design
- Generative AI
- Differential Item Functioning
- Psychometric Theory
- Large Language Models
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.