From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Expert, extended

Summary

A new framework evaluates sentence-level interpretability for rubric-based teaching quality assessment, combining Shapley value attributions (SHAP) with large language model (LLM) rationales. Instantiated on the Quality of Feedback dimension of the CLASS framework using 6,005 segments from the NCTE corpus, the study compared fine-tuned pretrained language models (PLMs) and prompted LLMs. Fine-tuned PLMs, like DeBERTaV3 large, achieved superior prediction accuracy (MAE 0.96, MSE 1.31) but showed label compression, predicting only mid-range scores (2.03-5.89). LLMs, while less accurate (best MAE 1.02, MSE 1.78), offered broader score distributions. Deletion-based tests revealed SHAP explanations were significantly more faithful, causing larger and more consistent prediction shifts than LLM rationales. Cross-model analysis further demonstrated SHAP's robust transferability across architectures, whereas LLM rationales had limited and inconsistent influence on PLM predictions.

Key takeaway

For Research Scientists developing automated assessment tools for high-stakes educational settings, prioritize Shapley value attributions (SHAP) for generating faithful and transferable explanations. While LLMs offer flexibility, their rationales are often unreliable and inconsistent, failing to accurately reflect model decision processes. You should rigorously validate any LLM-generated explanations to ensure transparency and mitigate risks of unfaithful justifications, especially given regulatory demands for explainable AI.

Key insights

SHAP offers more faithful and transferable explanations for rubric-based scoring than LLM-generated rationales.

Principles

Fine-tuned PLMs prioritize scoring accuracy, LLMs offer broader score distributions.
Explanation faithfulness is best measured by causal impact on model predictions.
LLM rationales often lack fidelity to underlying model decision processes.

Method

A framework combines SHAP attributions and LLM rationales for sentence-level interpretability, evaluated via deletion-based tests and cross-model transfer analysis.

In practice

Employ SHAP for robust, faithful explanations in PLM-based rubric scoring.
Validate LLM rationales rigorously; they may not reflect true model behavior.
Measure explanation faithfulness by observing prediction shifts after removing key sentences.

Topics

Explainable AI
Shapley Values
Large Language Models
Rubric-based Scoring
Teaching Quality Assessment
Explanation Faithfulness

Best for: AI Scientist, Research Scientist, AI Ethicist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.