Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering
Summary
This study evaluates an LLM-based grading pipeline designed for K-12 assessments, employing context and prompt engineering with commercially available foundation models. Researchers tested Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini against 822 student responses from the Massachusetts Comprehensive Assessment System (MCAS) across mathematics, science, and English Language Arts (ELA). Using Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE), the pipeline demonstrated substantial agreement with human raters in mathematics and science; for instance, GPT-5 achieved a QWK of 0.951 and PRMSE of 0.946 in mathematics. ELA performance varied significantly by model, with Claude Sonnet 4 showing exceptional utility in reading comprehension (PRMSE 0.941) while GPT-5 Mini struggled. Feedback indicates strong acceptance of AI narrative feedback but skepticism towards numerical scores, positioning LLMs as effective formative tools rather than summative evaluators.
Key takeaway
For educators or developers designing K-12 AI grading tools, you should prioritize context-engineered pipelines for mathematics and science, where LLMs demonstrate high reliability for formative feedback. While these systems can significantly reduce workload and enhance feedback quality, avoid using them for high-stakes summative scores, particularly in English Language Arts, due to variable performance and the need for nuanced human judgment. Implement hybrid workflows, retaining human oversight for final grades and continuously monitoring system reliability.
Key insights
Context engineering enables LLMs to reliably grade K-12 assessments, particularly for objective subjects, but human oversight is vital.
Principles
- Context engineering is critical for LLM grading reliability.
- LLM grading efficacy varies by subject construct.
- PRMSE offers robust evaluation of AI scoring utility.
Method
A grading pipeline systematically bundles assessment text, rubrics, exemplar responses, and instructions to guide LLM evaluation, ensuring consistent application across submissions.
In practice
- Deploy LLMs for formative feedback, not summative scores.
- Anchor LLM grading with detailed rubrics and answer keys.
Topics
- Generative AI
- Large Language Models
- K-12 Assessment
- Context Engineering
- Automated Grading
- Psychometric Evaluation
Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.