Unsupervised Evaluation of Explanations for Hate Speech Classification in Portuguese
Summary
A new framework has been developed for the unsupervised evaluation of explanation faithfulness in Portuguese hate speech detection models. This approach addresses the challenge of limited annotated explanation data and lack of standardized validation in Explainable AI (XAI). The framework operates by comparing model performance on original inputs against performance after removing identified explanatory keywords. Experiments utilized ensemble classifiers, various keyword selection strategies, and XAI methods like SHAP and LIME. Large Language Models (LLMs) were also investigated as both classifiers and explainers. Results indicate that removing explanatory keywords significantly degrades model performance compared to random word removal, confirming explanation faithfulness. SHAP and LIME consistently produced more faithful explanations than LLM-generated or manual alternatives, though the impact varied with keyword selection.
Key takeaway
For research scientists developing or deploying hate speech classification models in Portuguese, you should integrate unsupervised evaluation protocols to assess explanation faithfulness. This approach helps validate XAI methods without relying on scarce annotated data, ensuring that your explanations accurately reflect model decision-making and highlighting the current limitations of generative LLM explanations for this task.
Key insights
An unsupervised framework evaluates XAI explanation faithfulness by measuring model performance degradation upon keyword removal.
Principles
- Faithful explanations identify features critical to model performance.
- Performance degradation signals explanation faithfulness.
Method
The method involves predicting on original input, removing explanatory keywords, and then predicting on modified input, using performance differences as an evaluation signal.
In practice
- Use SHAP or LIME for more faithful explanations.
- Consider keyword selection strategy for XAI impact.
Topics
- Hate Speech Classification
- Explainable AI
- Unsupervised Explanation Evaluation
- SHAP and LIME
- Large Language Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.