Probing for Representation Manifolds in Superposition
Summary
The Manifold Probe is a new supervised method designed to identify representation manifolds within neural networks exhibiting superposition. This technique extends traditional linear regression probes by first learning the feature space of a concept that can be linearly predicted from the network's internal representations, and subsequently identifying the specific directions used for encoding these features. Researchers applied the Manifold Probe to analyze representations of time and space in the Llama 2-7b model. The probe successfully uncovered manifolds that linearly represent an interpretable set of features for both time and space. For time representations, steering along the discovered manifold allowed researchers to influence the model's predictions regarding release years for songs, movies, and books, suggesting a causal link between the manifold and model behavior.
Key takeaway
For research scientists investigating neural network interpretability, the Manifold Probe offers a novel approach to understanding how models encode complex concepts. You should consider applying this method to probe for specific feature manifolds in your own models, especially when seeking to establish causal links between internal representations and observable model behaviors, such as influencing factual recall or generation.
Key insights
The Manifold Probe discovers causally relevant representation manifolds in neural networks by generalizing linear regression probes.
Principles
- Superposition encodes multiple features in shared dimensions.
- Manifolds can represent interpretable feature sets.
Method
The Manifold Probe learns a feature space linearly predictable from representations, then identifies encoding directions. This generalizes linear regression probes to discover representation manifolds.
In practice
- Apply to Llama 2-7b for time/space representations.
- Steer along manifolds to influence model outputs.
Topics
- Manifold Probe
- Representation Manifolds
- Superposition
- Llama 2-7b
- Model Interpretability
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.