When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A new interpretable Out-of-Distribution (OOD) detection framework is proposed to address the overgeneralization of deep neural networks in medical imaging, a critical issue for safe clinical deployment. This framework probes the stability of model predictions by applying class-conditioned semantic perturbations to intermediate representations. It utilizes sparse autoencoders (SAEs) to learn class-specific concept vectors from in-distribution data, which disentangle dense representations into sparse, semantically meaningful components. During inference, deeper-layer representations are perturbed using concept vectors linked to the model's predicted class, and the stability of class logits is measured. The core hypothesis is that in-distribution samples exhibit low sensitivity to these perturbations due to alignment with class-specific semantic directions, while OOD samples show amplified deviations from representational misalignment. This approach offers both a discriminative OOD signal and an interpretable view into model uncertainty, making it suitable for high-stakes medical applications.

Key takeaway

For AI Scientists and Machine Learning Engineers developing models for safety-critical medical applications, this interpretable OOD detection framework offers a robust approach to enhance trust. You should consider integrating concept-conditioned stability analysis to identify out-of-distribution samples. This method provides both a clear OOD signal and an interpretable view into why your model might be uncertain, crucial for clinical deployment and regulatory compliance.

Key insights

Interpretable OOD detection can be achieved by analyzing prediction stability under concept-conditioned semantic perturbations.

Principles

Method

Learn class-specific concept vectors using sparse autoencoders (SAEs). Perturb deeper-layer representations with these vectors and measure class logits stability to detect OOD.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.