Probing for Representation Manifolds in Superposition

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The Manifold Probe is a new supervised method designed to identify representation manifolds within neural networks exhibiting superposition. This technique extends traditional linear regression probes by first learning the feature space of a concept that can be linearly predicted from the network's internal representations, and subsequently identifying the specific directions used for encoding these features. Researchers applied the Manifold Probe to analyze representations of time and space in the Llama 2-7b model. The probe successfully uncovered manifolds that linearly represent an interpretable set of features for both time and space. For time representations, steering along the discovered manifold allowed researchers to influence the model's predictions regarding release years for songs, movies, and books, suggesting a causal link between the manifold and model behavior.

Key takeaway

For research scientists investigating neural network interpretability, the Manifold Probe offers a novel approach to understanding how models encode complex concepts. You should consider applying this method to probe for specific feature manifolds in your own models, especially when seeking to establish causal links between internal representations and observable model behaviors, such as influencing factual recall or generation.

Key insights

The Manifold Probe discovers causally relevant representation manifolds in neural networks by generalizing linear regression probes.

Principles

Superposition encodes multiple features in shared dimensions.
Manifolds can represent interpretable feature sets.

Method

The Manifold Probe learns a feature space linearly predictable from representations, then identifies encoding directions. This generalizes linear regression probes to discover representation manifolds.

In practice

Apply to Llama 2-7b for time/space representations.
Steer along manifolds to influence model outputs.

Topics

Manifold Probe
Representation Manifolds
Superposition
Llama 2-7b
Model Interpretability

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.