When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations
Summary
A new interpretable Out-of-Distribution (OOD) detection framework is proposed to address the overgeneralization of deep neural networks in medical imaging, a critical issue for safe clinical deployment. This framework probes the stability of model predictions by applying class-conditioned semantic perturbations to intermediate representations. It utilizes sparse autoencoders (SAEs) to learn class-specific concept vectors from in-distribution data, which disentangle dense representations into sparse, semantically meaningful components. During inference, deeper-layer representations are perturbed using concept vectors linked to the model's predicted class, and the stability of class logits is measured. The core hypothesis is that in-distribution samples exhibit low sensitivity to these perturbations due to alignment with class-specific semantic directions, while OOD samples show amplified deviations from representational misalignment. This approach offers both a discriminative OOD signal and an interpretable view into model uncertainty, making it suitable for high-stakes medical applications.
Key takeaway
For AI Scientists and Machine Learning Engineers developing models for safety-critical medical applications, this interpretable OOD detection framework offers a robust approach to enhance trust. You should consider integrating concept-conditioned stability analysis to identify out-of-distribution samples. This method provides both a clear OOD signal and an interpretable view into why your model might be uncertain, crucial for clinical deployment and regulatory compliance.
Key insights
Interpretable OOD detection can be achieved by analyzing prediction stability under concept-conditioned semantic perturbations.
Principles
- OOD samples show amplified deviations to concept perturbations.
- In-distribution samples exhibit low sensitivity to concept perturbations.
- Semantic concept vectors disentangle dense representations.
Method
Learn class-specific concept vectors using sparse autoencoders (SAEs). Perturb deeper-layer representations with these vectors and measure class logits stability to detect OOD.
In practice
- Enhance trust in safety-critical AI systems.
- Improve reliability of medical imaging diagnostics.
- Gain insight into model uncertainty mechanisms.
Topics
- Out-of-Distribution Detection
- Interpretable AI
- Sparse Autoencoders
- Medical Imaging
- Model Uncertainty
- Deep Neural Networks
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.