Unsupervised Evaluation of Explanations for Hate Speech Classification in Portuguese

2026-04-12 · Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Advanced, quick

Summary

A new framework has been developed for the unsupervised evaluation of explanation faithfulness in Portuguese hate speech detection models. This approach addresses the challenge of limited annotated explanation data and lack of standardized validation in Explainable AI (XAI). The framework operates by comparing model performance on original inputs against performance after removing identified explanatory keywords. Experiments utilized ensemble classifiers, various keyword selection strategies, and XAI methods like SHAP and LIME. Large Language Models (LLMs) were also investigated as both classifiers and explainers. Results indicate that removing explanatory keywords significantly degrades model performance compared to random word removal, confirming explanation faithfulness. SHAP and LIME consistently produced more faithful explanations than LLM-generated or manual alternatives, though the impact varied with keyword selection.

Key takeaway

For research scientists developing or deploying hate speech classification models in Portuguese, you should integrate unsupervised evaluation protocols to assess explanation faithfulness. This approach helps validate XAI methods without relying on scarce annotated data, ensuring that your explanations accurately reflect model decision-making and highlighting the current limitations of generative LLM explanations for this task.

Key insights

An unsupervised framework evaluates XAI explanation faithfulness by measuring model performance degradation upon keyword removal.

Principles

Faithful explanations identify features critical to model performance.
Performance degradation signals explanation faithfulness.

Method

The method involves predicting on original input, removing explanatory keywords, and then predicting on modified input, using performance differences as an evaluation signal.

In practice

Use SHAP or LIME for more faithful explanations.
Consider keyword selection strategy for XAI impact.

Topics

Hate Speech Classification
Explainable AI
Unsupervised Explanation Evaluation
SHAP and LIME
Large Language Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.