Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
Summary
A study on Sparse Autoencoders (SAEs) investigates the reproducibility of learned features across different training runs, introducing "feature stability" as a metric to estimate the probability of similar features reappearing. Across a large-scale analysis spanning various seeds, models, layers, dictionary sizes, and SAE variants, the research identifies a functional asymmetry: stable features are crucial for reconstruction and prediction, while unstable features contribute minimally and are linked to low-frequency surface-form triggers. Geometrically, unstable features, though individually non-reproducible, converge within reproducible lower-rank subspaces, indicating that seed dependence stems from basis ambiguity rather than random noise. A synthetic model explicitly demonstrates that low-rank ground-truth features are recoverable at the subspace level, even if individual SAE latents are not identifiable across seeds. The work also shows that pooling unique cross-seed features can create more stable SAEs while maintaining explained variance.
Key takeaway
For Machine Learning Engineers interpreting neural network representations with Sparse Autoencoders, understanding feature stability is crucial. You should prioritize analyzing stable features, as they carry the primary functional signal for reconstruction and prediction. Recognize that individually unstable features often reflect basis ambiguity within reproducible subspaces, not just noise. Consider pooling unique cross-seed features to construct more stable SAEs, enhancing the reliability of your model interpretations and ensuring consistent insights across training runs.
Key insights
Unstable SAE features, though individually non-reproducible, reflect shared low-dimensional structure due to basis ambiguity, not noise.
Principles
- Feature stability separates functional from weak-impact SAE features.
- Seed dependence in SAEs often reflects basis ambiguity.
- Low-rank features can be recovered at the subspace level.
Method
Estimate feature stability by calculating the probability of similar features reappearing in independently trained SAEs. More stable SAEs can be constructed by pooling unique cross-seed features.
In practice
- Identify and filter unstable SAE features for clearer interpretation.
- Improve SAE stability by aggregating features across runs.
- Focus interpretation on stable features for reliable insights.
Topics
- Sparse Autoencoders
- Feature Stability
- Neural Network Interpretation
- Model Reproducibility
- Basis Ambiguity
- Machine Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.