“Features” aren’t always the true computational primitives of a model, but that might be fine anyways
Summary
The concept of "feature" in mechanistic interpretability is often debated, with three primary definitions emerging: computational primitives, salient properties of the input, and features assumed to exist for theoretical discussions. Computational primitives are challenging to reverse-engineer, as illustrated by the modular addition network example where the MLP's function was progressively understood from sin/cos factors to a trigonometric integral. Salient properties are practical, referring to model-represented input characteristics that can be manipulated to alter behavior, as seen in SAEs. The article proposes that "features" exist on a spectrum from pure memorization to case analysis/equivalence partitioning, culminating in true computational primitives. This spectrum is demonstrated using a toy vision-language model classifying "bleggs vs rubes," showing how a model could represent two properties (redness, cubeness), four object types, or even memorize individual data points, each being a valid "feature" interpretation depending on the context.
Key takeaway
For research scientists grappling with feature definitions in mechanistic interpretability, recognize that "features" are not monolithic but exist on a spectrum from memorized data points to abstract computational primitives. This understanding should guide your interpretability efforts, allowing you to select the most appropriate feature interpretation for your specific analytical goal, whether it's debugging, formal verification, or understanding model generalization. Avoid rigid definitions and instead focus on what computations neural networks implement and how components contribute to a compact description of that computation.
Key insights
Features in mechanistic interpretability exist on a spectrum from memorization to computational primitives.
Principles
- No single "true" feature definition exists.
- Features are useful if they reduce loss.
- Interpretability benefits from case analysis.
In practice
- Consider feature spectrum when interpreting models.
- Use case analysis for debugging model behavior.
- Recognize memorized facts as valid features.
Topics
- Mechanistic Interpretability
- Neural Network Features
- Sparse Autoencoders
- Feature Spectrum
- Model Behavior Analysis
Best for: Research Scientist, AI Researcher, AI Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.