Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on Sparse Autoencoders (SAEs) investigates the reproducibility of learned features across different training runs, introducing "feature stability" as a metric to estimate the probability of similar features reappearing. Across a large-scale analysis spanning various seeds, models, layers, dictionary sizes, and SAE variants, the research identifies a functional asymmetry: stable features are crucial for reconstruction and prediction, while unstable features contribute minimally and are linked to low-frequency surface-form triggers. Geometrically, unstable features, though individually non-reproducible, converge within reproducible lower-rank subspaces, indicating that seed dependence stems from basis ambiguity rather than random noise. A synthetic model explicitly demonstrates that low-rank ground-truth features are recoverable at the subspace level, even if individual SAE latents are not identifiable across seeds. The work also shows that pooling unique cross-seed features can create more stable SAEs while maintaining explained variance.

Key takeaway

For Machine Learning Engineers interpreting neural network representations with Sparse Autoencoders, understanding feature stability is crucial. You should prioritize analyzing stable features, as they carry the primary functional signal for reconstruction and prediction. Recognize that individually unstable features often reflect basis ambiguity within reproducible subspaces, not just noise. Consider pooling unique cross-seed features to construct more stable SAEs, enhancing the reliability of your model interpretations and ensuring consistent insights across training runs.

Key insights

Unstable SAE features, though individually non-reproducible, reflect shared low-dimensional structure due to basis ambiguity, not noise.

Principles

Method

Estimate feature stability by calculating the probability of similar features reappearing in independently trained SAEs. More stable SAEs can be constructed by pooling unique cross-seed features.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.