LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

A curriculum-grounded, configurable LLM-as-Judge pipeline has been developed for question-level marking, specifically to support university admission exam preparation. This system systematically grounds LLM outputs in authorized curriculum artifacts, including the NSW HSC syllabus, marking guidelines, performance band descriptors, and glossary definitions. It employs a staged LLM workflow to generate question-specific rubrics and derive marking criteria, enhancing consistency, transparency, and alignment with official practices. Co-developed with Studitory, an online study platform serving over 5,700 students, the pipeline underwent preliminary evaluation. Results indicate marking outcomes comparable to human tutors, with justifications more traceable to authorized curriculum artifacts. Initial deployment data from January 31 to March 7, 2026, showed a 2.91% manual override rate across 3,166 processed answers.

Key takeaway

For AI Scientists and Machine Learning Engineers deploying LLMs in high-stakes educational assessment, you should prioritize architecturally grounding your systems in authorized curriculum artifacts. While direct LLM prompting might numerically match human marks, it often lacks the verifiable alignment and traceable justifications essential for trustworthiness. Implement structured pipelines with explicit verification points to ensure consistency, transparency, and adherence to official marking standards, thereby enabling auditability and responsible AI deployment.

Key insights

LLM assessment grounded in official curriculum artifacts improves consistency, transparency, and traceability for high-stakes education.

Principles

Method

The pipeline identifies question topics/skills, assembles verifiable curriculum context, generates question-specific rubrics, then derives and evaluates marking criteria using a staged LLM workflow and RAG.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.