LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition, Health & Medical Research · Depth: Expert, quick

Summary

This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. It addresses the gap in systematically examining if model explanations align with clinically relevant lesion regions, unlike prior work focused on classification performance via data augmentation. The study applied geometric, color-based, and mixed augmentation strategies to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge framework, utilizing GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6, evaluated Grad-CAM explanations for lesion localization and trustworthiness. A progressive prompt engineering strategy, including evaluation rubrics, clinical knowledge, penalty rules, and structured output, was implemented to enhance evaluation consistency and clinical grounding.

Key takeaway

For AI Scientists developing explainable AI for medical imaging, you should integrate LLM-as-a-Judge frameworks to validate visual explanations like Grad-CAM. This ensures your models' decision-making aligns with clinical relevance, moving beyond mere classification performance. Consider implementing progressive prompt engineering with clinical rubrics and penalty rules to enhance evaluation consistency and trustworthiness, especially when deploying models for sensitive diagnostic applications.

Key insights

An LLM-based framework evaluates Grad-CAM explanations for facial skin disease models, ensuring clinical relevance and trustworthiness.

Principles

Model explanations need clinical grounding.
LLMs can act as expert judges.
Prompt engineering improves LLM evaluation.

Method

Apply geometric, color, and mixed augmentations to models (EfficientNet-B0, MobileNetV3, ResNet18). Generate Grad-CAM explanations. Use an LLM-as-a-Judge framework (GPT-5.5, Gemini 3.5 Flash, Claude Sonnet 4.6) with progressive prompt engineering to assess localization and trustworthiness.

In practice

Use LLMs for XAI evaluation.
Integrate clinical knowledge into prompts.
Test various augmentation strategies.

Topics

LLM-as-a-Judge
Explainable AI
Grad-CAM
Facial Skin Disease
Medical Imaging
Prompt Engineering

Best for: AI Scientist, Computer Vision Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.