From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A study investigated whether common interpretability featurization methods, including sparse autoencoders (SAEs) and sparse probes, recover disentangled and independently manipulable concept representations in neural networks. Using a natural language dataset of 382,884 sentences with controlled correlations between four concepts (voice, tense, sentiment, domain), the research evaluated Pythia-70M and Gemma-2-2B. It found that while these methods achieve high scores on correlational disentanglement metrics like MCC and DCI-ES, this does not predict selective manipulability during steering experiments. Steering one concept often affected unrelated concepts, demonstrating widespread non-independence. Although features operate on disjoint subspaces, this did not guarantee functional independence. Supervised probes consistently outperformed unsupervised SAEs, and correlations exceeding 0.5 in training data degraded concept identification for most methods. The study also noted that a single feature dimension is often insufficient for concept control.

Key takeaway

For AI Scientists and ML Engineers evaluating interpretability methods, relying solely on correlational disentanglement metrics like MCC or DCI-ES is insufficient. Your steering experiments may still reveal widespread non-independence, even if features appear disjoint. You should incorporate multi-concept evaluations and counterfactual interventions to truly assess causal independence and selective manipulability, especially when concept correlations exceed 0.5 in your training data. Also, consider that a single feature dimension might not fully control a concept.

Key insights

Correlational disentanglement metrics for interpretability methods do not guarantee independent concept manipulability during steering.

Principles

Method

A multi-concept evaluation framework was introduced, using PCFG-generated natural language data with adjustable concept correlations. It measured identifiability via MCC/DCI-ES and causal independence via steering interventions.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.