Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, short

Summary

A study investigates the efficacy of open-source Large Language Models (LLMs) for grading Unified Modeling Language (UML) class diagrams, focusing on transparency and cost-effectiveness for academic institutions. Researchers developed a grading pipeline comparing evaluations from teaching assistants (TAs) and six open-source LLMs on 92 student-generated UML class diagrams from a software design course. Unlike prior work, this approach assesses performance at the individual grading criterion level, providing granular insights into LLM-human alignment. The quantitative study revealed per-criterion accuracy reaching 88.56% and a Pearson correlation coefficient of up to 0.78, significantly outperforming previous efforts with open-source models. These results indicate that open-source LLMs can effectively support UML class diagram grading, offering a viable path toward mixed-initiative grading systems to help TAs manage growing student workloads.

Key takeaway

For university educators and software engineering instructors facing increasing student enrollments, integrating open-source LLMs into your grading workflow for UML class diagrams offers a practical solution. You can significantly reduce TA workload by utilizing LLMs for initial criterion-level assessments, freeing TAs to focus on nuanced feedback. Consider piloting a mixed-initiative grading system where LLMs provide a baseline, improving grading efficiency and consistency without compromising transparency or incurring high costs.

Key insights

Open-source LLMs can effectively grade UML class diagrams with high per-criterion accuracy, aiding TAs.

Principles

Method

The proposed grading pipeline involves independent evaluation of student UML class diagrams by TAs and LLMs, followed by comparison of grades at the individual criterion level.

In practice

Topics

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.