Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
Summary
Generative-Evaluative Agreement (GEA) is introduced as a critical validity criterion for adaptive assessment systems powered by Large Language Models (LLMs). This criterion addresses the self-referential validation loop that occurs when an LLM both generates assessment items, simulates student responses, and scores them. The first direct measurement of GEA on a two-stage adaptive assessment revealed that the model recovered approximately half the intended variance, with a correlation coefficient (r) of 0.698, alongside a systematic positive bias. Findings indicate GEA is robust (r > 0.7) for skills that are syntactically verifiable but approaches zero for design-level skills. Furthermore, low-skill overestimation significantly inflates scores near the routing threshold. The primary mechanism proposed for enhancing GEA involves implementing granular, skill-decomposed rubrics, with additional mitigations outlined.
Key takeaway
For AI Scientists developing LLM-enabled adaptive assessments, you must prioritize Generative-Evaluative Agreement (GEA) to ensure assessment validity. Recognize that LLM scoring can recover only about half the intended variance. It also exhibits systematic positive bias, especially overestimating low-skill levels near routing thresholds. Implement granular, skill-decomposed rubrics to strengthen GEA, particularly for complex design-level skills where current agreement is near zero. Your validation efforts should explicitly measure and mitigate these biases.
Key insights
Generative-Evaluative Agreement (GEA) is crucial for validating LLM-based adaptive assessments, revealing significant biases and skill-dependent accuracy.
Principles
- LLM assessment validity needs external agreement.
- Skill type dictates LLM scoring accuracy.
- Granular rubrics improve LLM assessment validity.
In practice
- Implement skill-decomposed rubrics.
- Scrutinize design-level skill assessments.
- Monitor routing threshold score inflation.
Topics
- Generative-Evaluative Agreement
- LLM Assessment
- Adaptive Learning
- Validity Criteria
- Skill Assessment
- Rubric Design
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.