Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects
Summary
A new pre-intervention screening framework forecasts sparse autoencoder (SAE) feature steering side effects in language models. This framework predicts steering modularity, defined by effect stability and collateral spread, using feature statistics computed before intervention. It was evaluated across GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B, utilizing ReLU, JumpReLU, and TopK SAE dictionaries. Predictors like decoder geometry, activation statistics, co-activation structure, and direct-logit footprint consistently outperformed frequency-only and activation-magnitude baselines. The predictive signal was strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, and weaker in Gemma-2-2B. Held-out screening showed ranking features by predicted cleanliness can lead to cleaner steering on new contexts, with the successful modularity axis varying by model. The predictive signal persisted under a 32K-to-128K dictionary-width change in Llama Scope. Overall, SAE steering side effects are predictable, but effective predictors and transferred modularity axes depend on the specific model and dictionary settings.
Key takeaway
For Machine Learning Engineers implementing sparse autoencoder (SAE) feature steering, you should integrate pre-intervention screening to predict and mitigate side effects. By analyzing feature statistics like decoder geometry and co-activation structure, you can proactively select features that promise cleaner, more stable interventions with reduced collateral spread. Be aware that the most effective predictors and the specific modularity improvements will vary significantly based on your chosen language model and SAE dictionary settings.
Key insights
SAE steering side effects are predictable from pre-intervention feature statistics, though specific predictors vary by model.
Principles
- Steering modularity has two axes: effect stability and collateral spread.
- Decoder geometry and co-activation predict steering cleanliness.
- Predictor signatures are model- and dictionary-dependent.
Method
A pre-intervention screening framework forecasts SAE steering side effects. It uses feature statistics like decoder geometry, activation statistics, co-activation structure, and direct-logit footprint to predict effect stability and collateral spread.
In practice
- Screen SAE features before steering for cleaner interventions.
- Prioritize features with high predicted effect stability.
- Prioritize features with low predicted collateral spread.
Topics
- Sparse Autoencoders
- Language Model Steering
- Feature Modularity
- Side Effect Prediction
- GPT-2-small
- Llama-3.1-8B
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.