Comparing Linear Probes with Mahalanobis Cosine Similarity
Summary
This research introduces Mahalanobis Cosine Similarity (MCS) as a task-aware refinement for comparing linear probes, commonly used in interpretability studies. Building on Ying et al.'s 2026 finding that a probe's MCS to a reference probe near-perfectly predicts its out-of-distribution (OOD) AUROC (R^2 = 0.98), this work extends the empirical validation across various models, layers, and concept domains. The authors provide a closed-form proof demonstrating that for balanced classes with Gaussian projections, OOD AUROC and MCS exhibit a linear relationship because both are sigmoid functions of the probe's signal-to-noise ratio (SNR) on test data. The study also theoretically predicts and empirically verifies conditions under which this linearity breaks down, positioning MCS as a theoretically grounded and effective alternative to Euclidean cosine similarity.
Key takeaway
For AI scientists evaluating linear probes in interpretability research, consider adopting Mahalanobis Cosine Similarity (MCS) over traditional Euclidean cosine similarity. MCS offers a more robust, task-aware metric that has been proven to linearly predict out-of-distribution AUROC with high accuracy (R^2 = 0.98). This allows you to more reliably assess probe performance and generalize findings, particularly when dealing with balanced classes and Gaussian projections.
Key insights
Mahalanobis Cosine Similarity (MCS) offers a theoretically grounded, empirically effective method for comparing linear probes, linearly predicting OOD AUROC.
Principles
- MCS reweights inner products using test data covariance.
- OOD AUROC and MCS are linearly related for Gaussian-projected classes.
- Linearity is due to both being sigmoid functions of SNR.
In practice
- Employ MCS for comparing linear probes in interpretability.
- MCS improves OOD AUROC prediction for balanced classes.
Topics
- Linear Probes
- Mahalanobis Cosine Similarity
- Interpretability Research
- Out-of-Distribution Detection
- AUROC
- Signal-to-Noise Ratio
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.