Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics
Summary
A "Human-in-the-Loop" benchmarking framework was developed to assess the effectiveness of multiple Large Language Models (LLMs) in automating secondary-level mathematics assessment, specifically for Grade 10 Optional Mathematics in Nepal. The framework utilized a multi-dimensional rubric covering four topics and four competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. An ensemble of open-weight models (Eagle Llama 3.1-8B, Orion Llama 3.3-70B) and proprietary frontier models (Nova Gemini 2.5 Flash, Lyra Gemini 3 Pro) was benchmarked against a ground truth established by two senior mathematics faculty members (kappa_w = 0.8652). Findings revealed an "Architecture-compatibility gap," where Gemini-based Mixture-of-Experts (Sparse MoE) models achieved "Fair Agreement" (kappa_w ~ 0.38), but the larger Orion (70B) model showed "No Agreement" (kappa_w = -0.0261). This suggests that architectural compliance with instruction constraints is more critical than raw parameter scale for rubric-constrained tasks.
Key takeaway
For educators considering LLMs for competency-based assessment, prioritize models with architectures known for instruction compliance, such as Sparse MoE, over simply larger models. While LLMs are not yet ready for autonomous certification, integrating them into a "Human-in-the-Loop" framework can provide valuable assistive support for preliminary evidence extraction, significantly reducing manual effort in qualitative competency mapping.
Key insights
LLM architecture compatibility with instruction constraints is more critical than model scale for rubric-constrained assessment tasks.
Principles
- Architecture compatibility outweighs parameter scale.
- Human-in-the-loop improves LLM assessment reliability.
Method
A "Human-in-the-Loop" benchmarking framework assesses LLMs using a multi-dimensional rubric and compares against human-defined ground truth to evaluate agreement.
In practice
- Use Sparse MoE models for rubric-based tasks.
- Integrate human oversight for LLM-based assessments.
Topics
- Competency-Based Education
- LLM Benchmarking
- Secondary Mathematics Assessment
- Human-in-the-Loop
- LLM Architectures
Best for: AI Scientist, Research Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.