LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models
Summary
This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. It addresses the gap in systematically examining if model explanations align with clinically relevant lesion regions, unlike prior work focused on classification performance via data augmentation. The study applied geometric, color-based, and mixed augmentation strategies to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge framework, utilizing GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6, evaluated Grad-CAM explanations for lesion localization and trustworthiness. A progressive prompt engineering strategy, including evaluation rubrics, clinical knowledge, penalty rules, and structured output, was implemented to enhance evaluation consistency and clinical grounding.
Key takeaway
For AI Scientists developing explainable AI for medical imaging, you should integrate LLM-as-a-Judge frameworks to validate visual explanations like Grad-CAM. This ensures your models' decision-making aligns with clinical relevance, moving beyond mere classification performance. Consider implementing progressive prompt engineering with clinical rubrics and penalty rules to enhance evaluation consistency and trustworthiness, especially when deploying models for sensitive diagnostic applications.
Key insights
An LLM-based framework evaluates Grad-CAM explanations for facial skin disease models, ensuring clinical relevance and trustworthiness.
Principles
- Model explanations need clinical grounding.
- LLMs can act as expert judges.
- Prompt engineering improves LLM evaluation.
Method
Apply geometric, color, and mixed augmentations to models (EfficientNet-B0, MobileNetV3, ResNet18). Generate Grad-CAM explanations. Use an LLM-as-a-Judge framework (GPT-5.5, Gemini 3.5 Flash, Claude Sonnet 4.6) with progressive prompt engineering to assess localization and trustworthiness.
In practice
- Use LLMs for XAI evaluation.
- Integrate clinical knowledge into prompts.
- Test various augmentation strategies.
Topics
- LLM-as-a-Judge
- Explainable AI
- Grad-CAM
- Facial Skin Disease
- Medical Imaging
- Prompt Engineering
Best for: AI Scientist, Computer Vision Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.