From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment
Summary
A new framework evaluates sentence-level interpretability for rubric-based teaching quality assessment, combining Shapley value attributions (SHAP) with large language model (LLM) rationales. Instantiated on the Quality of Feedback dimension of the CLASS framework using 6,005 segments from the NCTE corpus, the study compared fine-tuned pretrained language models (PLMs) and prompted LLMs. Fine-tuned PLMs, like DeBERTaV3 large, achieved superior prediction accuracy (MAE 0.96, MSE 1.31) but showed label compression, predicting only mid-range scores (2.03-5.89). LLMs, while less accurate (best MAE 1.02, MSE 1.78), offered broader score distributions. Deletion-based tests revealed SHAP explanations were significantly more faithful, causing larger and more consistent prediction shifts than LLM rationales. Cross-model analysis further demonstrated SHAP's robust transferability across architectures, whereas LLM rationales had limited and inconsistent influence on PLM predictions.
Key takeaway
For Research Scientists developing automated assessment tools for high-stakes educational settings, prioritize Shapley value attributions (SHAP) for generating faithful and transferable explanations. While LLMs offer flexibility, their rationales are often unreliable and inconsistent, failing to accurately reflect model decision processes. You should rigorously validate any LLM-generated explanations to ensure transparency and mitigate risks of unfaithful justifications, especially given regulatory demands for explainable AI.
Key insights
SHAP offers more faithful and transferable explanations for rubric-based scoring than LLM-generated rationales.
Principles
- Fine-tuned PLMs prioritize scoring accuracy, LLMs offer broader score distributions.
- Explanation faithfulness is best measured by causal impact on model predictions.
- LLM rationales often lack fidelity to underlying model decision processes.
Method
A framework combines SHAP attributions and LLM rationales for sentence-level interpretability, evaluated via deletion-based tests and cross-model transfer analysis.
In practice
- Employ SHAP for robust, faithful explanations in PLM-based rubric scoring.
- Validate LLM rationales rigorously; they may not reflect true model behavior.
- Measure explanation faithfulness by observing prediction shifts after removing key sentences.
Topics
- Explainable AI
- Shapley Values
- Large Language Models
- Rubric-based Scoring
- Teaching Quality Assessment
- Explanation Faithfulness
Best for: AI Scientist, Research Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.