LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline
Summary
A curriculum-grounded, configurable LLM-as-Judge pipeline has been developed for question-level marking, specifically to support university admission exam preparation. This system systematically grounds LLM outputs in authorized curriculum artifacts, including the NSW HSC syllabus, marking guidelines, performance band descriptors, and glossary definitions. It employs a staged LLM workflow to generate question-specific rubrics and derive marking criteria, enhancing consistency, transparency, and alignment with official practices. Co-developed with Studitory, an online study platform serving over 5,700 students, the pipeline underwent preliminary evaluation. Results indicate marking outcomes comparable to human tutors, with justifications more traceable to authorized curriculum artifacts. Initial deployment data from January 31 to March 7, 2026, showed a 2.91% manual override rate across 3,166 processed answers.
Key takeaway
For AI Scientists and Machine Learning Engineers deploying LLMs in high-stakes educational assessment, you should prioritize architecturally grounding your systems in authorized curriculum artifacts. While direct LLM prompting might numerically match human marks, it often lacks the verifiable alignment and traceable justifications essential for trustworthiness. Implement structured pipelines with explicit verification points to ensure consistency, transparency, and adherence to official marking standards, thereby enabling auditability and responsible AI deployment.
Key insights
LLM assessment grounded in official curriculum artifacts improves consistency, transparency, and traceability for high-stakes education.
Principles
- Curriculum intent must operationalize through concrete syllabus artifacts.
- Tacit human judgment can be embedded via authorized documents and constraints.
- Assessment trustworthiness emerges from architectural design, not isolated model behaviors.
Method
The pipeline identifies question topics/skills, assembles verifiable curriculum context, generates question-specific rubrics, then derives and evaluates marking criteria using a staged LLM workflow and RAG.
In practice
- Integrate authorized syllabus documents as structured context.
- Employ staged LLM workflows for rubric and criteria generation.
- Implement verification points for curriculum alignment.
Topics
- LLM-as-Judge
- Automated Assessment
- Curriculum Alignment
- Retrieval-Augmented Generation
- Educational Technology
- Assessment Pipeline
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.