Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects
Summary
Query Lens is a novel method designed to enhance the interpretability of sparse features derived from sparse autoencoders, addressing the challenges of reliably characterizing these features. Extending the existing Logit Lens technique, Query Lens offers more comprehensive and faithful interpretations by jointly analyzing encoder-side key features and decoder-side value features. This approach allows for the identification of specific inputs that activate a feature and the corresponding outputs it promotes. Crucially, Query Lens also incorporates indirect, module-mediated effects that occur when a feature is processed by subsequent modules, a capability that goes beyond the direct effects captured by Logit Lens. Experimental results indicate that Query Lens successfully generates coherent token signatures for features that Logit Lens previously could not interpret. The paper also introduces the Subspace Channel Hypothesis, proposing that downstream modules access features via layer-specific subspaces.
Key takeaway
For Machine Learning Engineers working with sparse autoencoders to understand model internals, Query Lens offers a more robust interpretation framework. You should consider integrating Query Lens to characterize features that remain opaque with traditional Logit Lens methods, especially when indirect, module-mediated effects are suspected. This can lead to more coherent token signatures and deeper insights into how features influence downstream processing, improving your ability to debug and refine models.
Key insights
Query Lens improves sparse feature interpretability by analyzing both direct and indirect effects in autoencoders.
Principles
- Jointly consider key and value features.
- Account for module-mediated indirect effects.
- Downstream modules use layer-specific subspaces.
Method
Query Lens extends Logit Lens by jointly considering encoder-side key and decoder-side value features, identifying activation inputs and promoted outputs, and incorporating indirect, module-mediated effects from downstream processing.
In practice
- Characterize previously uninterpretable features.
- Gain deeper insights into sparse autoencoder behavior.
Topics
- Sparse Autoencoders
- Feature Interpretability
- Query Lens
- Logit Lens
- Model Understanding
- Subspace Channel Hypothesis
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.