Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A "Human-in-the-Loop" benchmarking framework was developed to assess the effectiveness of multiple Large Language Models (LLMs) in automating secondary-level mathematics assessment, specifically for Grade 10 Optional Mathematics in Nepal. The framework utilized a multi-dimensional rubric covering four topics and four competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. An ensemble of open-weight models (Eagle Llama 3.1-8B, Orion Llama 3.3-70B) and proprietary frontier models (Nova Gemini 2.5 Flash, Lyra Gemini 3 Pro) was benchmarked against a ground truth established by two senior mathematics faculty members (kappa_w = 0.8652). Findings revealed an "Architecture-compatibility gap," where Gemini-based Mixture-of-Experts (Sparse MoE) models achieved "Fair Agreement" (kappa_w ~ 0.38), but the larger Orion (70B) model showed "No Agreement" (kappa_w = -0.0261). This suggests that architectural compliance with instruction constraints is more critical than raw parameter scale for rubric-constrained tasks.

Key takeaway

For educators considering LLMs for competency-based assessment, prioritize models with architectures known for instruction compliance, such as Sparse MoE, over simply larger models. While LLMs are not yet ready for autonomous certification, integrating them into a "Human-in-the-Loop" framework can provide valuable assistive support for preliminary evidence extraction, significantly reducing manual effort in qualitative competency mapping.

Key insights

LLM architecture compatibility with instruction constraints is more critical than model scale for rubric-constrained assessment tasks.

Principles

Method

A "Human-in-the-Loop" benchmarking framework assesses LLMs using a multi-dimensional rubric and compares against human-defined ground truth to evaluate agreement.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.