Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Education & Learning — Educational Technology (EdTech), K-12 Education & Child Development · Depth: Expert, extended

Summary

This study evaluates an LLM-based grading pipeline designed for K-12 assessments, employing context and prompt engineering with commercially available foundation models. Researchers tested Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini against 822 student responses from the Massachusetts Comprehensive Assessment System (MCAS) across mathematics, science, and English Language Arts (ELA). Using Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE), the pipeline demonstrated substantial agreement with human raters in mathematics and science; for instance, GPT-5 achieved a QWK of 0.951 and PRMSE of 0.946 in mathematics. ELA performance varied significantly by model, with Claude Sonnet 4 showing exceptional utility in reading comprehension (PRMSE 0.941) while GPT-5 Mini struggled. Feedback indicates strong acceptance of AI narrative feedback but skepticism towards numerical scores, positioning LLMs as effective formative tools rather than summative evaluators.

Key takeaway

For educators or developers designing K-12 AI grading tools, you should prioritize context-engineered pipelines for mathematics and science, where LLMs demonstrate high reliability for formative feedback. While these systems can significantly reduce workload and enhance feedback quality, avoid using them for high-stakes summative scores, particularly in English Language Arts, due to variable performance and the need for nuanced human judgment. Implement hybrid workflows, retaining human oversight for final grades and continuously monitoring system reliability.

Key insights

Context engineering enables LLMs to reliably grade K-12 assessments, particularly for objective subjects, but human oversight is vital.

Principles

Context engineering is critical for LLM grading reliability.
LLM grading efficacy varies by subject construct.
PRMSE offers robust evaluation of AI scoring utility.

Method

A grading pipeline systematically bundles assessment text, rubrics, exemplar responses, and instructions to guide LLM evaluation, ensuring consistent application across submissions.

In practice

Deploy LLMs for formative feedback, not summative scores.
Anchor LLM grading with detailed rubrics and answer keys.

Topics

Generative AI
Large Language Models
K-12 Assessment
Context Engineering
Automated Grading
Psychometric Evaluation

Best for: NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.