Comparing Linear Probes with Mahalanobis Cosine Similarity

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

This research introduces Mahalanobis Cosine Similarity (MCS) as a task-aware refinement for comparing linear probes, commonly used in interpretability studies. Building on Ying et al.'s 2026 finding that a probe's MCS to a reference probe near-perfectly predicts its out-of-distribution (OOD) AUROC (R^2 = 0.98), this work extends the empirical validation across various models, layers, and concept domains. The authors provide a closed-form proof demonstrating that for balanced classes with Gaussian projections, OOD AUROC and MCS exhibit a linear relationship because both are sigmoid functions of the probe's signal-to-noise ratio (SNR) on test data. The study also theoretically predicts and empirically verifies conditions under which this linearity breaks down, positioning MCS as a theoretically grounded and effective alternative to Euclidean cosine similarity.

Key takeaway

For AI scientists evaluating linear probes in interpretability research, consider adopting Mahalanobis Cosine Similarity (MCS) over traditional Euclidean cosine similarity. MCS offers a more robust, task-aware metric that has been proven to linearly predict out-of-distribution AUROC with high accuracy (R^2 = 0.98). This allows you to more reliably assess probe performance and generalize findings, particularly when dealing with balanced classes and Gaussian projections.

Key insights

Mahalanobis Cosine Similarity (MCS) offers a theoretically grounded, empirically effective method for comparing linear probes, linearly predicting OOD AUROC.

Principles

In practice

Topics

Best for: Research Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.