From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Expert, extended

Summary

A new framework evaluates sentence-level interpretability for rubric-based teaching quality assessment, combining Shapley value attributions (SHAP) with large language model (LLM) rationales. Instantiated on the Quality of Feedback dimension of the CLASS framework using 6,005 segments from the NCTE corpus, the study compared fine-tuned pretrained language models (PLMs) and prompted LLMs. Fine-tuned PLMs, like DeBERTaV3 large, achieved superior prediction accuracy (MAE 0.96, MSE 1.31) but showed label compression, predicting only mid-range scores (2.03-5.89). LLMs, while less accurate (best MAE 1.02, MSE 1.78), offered broader score distributions. Deletion-based tests revealed SHAP explanations were significantly more faithful, causing larger and more consistent prediction shifts than LLM rationales. Cross-model analysis further demonstrated SHAP's robust transferability across architectures, whereas LLM rationales had limited and inconsistent influence on PLM predictions.

Key takeaway

For Research Scientists developing automated assessment tools for high-stakes educational settings, prioritize Shapley value attributions (SHAP) for generating faithful and transferable explanations. While LLMs offer flexibility, their rationales are often unreliable and inconsistent, failing to accurately reflect model decision processes. You should rigorously validate any LLM-generated explanations to ensure transparency and mitigate risks of unfaithful justifications, especially given regulatory demands for explainable AI.

Key insights

SHAP offers more faithful and transferable explanations for rubric-based scoring than LLM-generated rationales.

Principles

Method

A framework combines SHAP attributions and LLM rationales for sentence-level interpretability, evaluated via deletion-based tests and cross-model transfer analysis.

In practice

Topics

Best for: AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.