Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

2026-06-06 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new pre-intervention screening framework forecasts sparse autoencoder (SAE) feature steering side effects in language models. This framework predicts steering modularity, defined by effect stability and collateral spread, using feature statistics computed before intervention. It was evaluated across GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B, utilizing ReLU, JumpReLU, and TopK SAE dictionaries. Predictors like decoder geometry, activation statistics, co-activation structure, and direct-logit footprint consistently outperformed frequency-only and activation-magnitude baselines. The predictive signal was strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, and weaker in Gemma-2-2B. Held-out screening showed ranking features by predicted cleanliness can lead to cleaner steering on new contexts, with the successful modularity axis varying by model. The predictive signal persisted under a 32K-to-128K dictionary-width change in Llama Scope. Overall, SAE steering side effects are predictable, but effective predictors and transferred modularity axes depend on the specific model and dictionary settings.

Key takeaway

For Machine Learning Engineers implementing sparse autoencoder (SAE) feature steering, you should integrate pre-intervention screening to predict and mitigate side effects. By analyzing feature statistics like decoder geometry and co-activation structure, you can proactively select features that promise cleaner, more stable interventions with reduced collateral spread. Be aware that the most effective predictors and the specific modularity improvements will vary significantly based on your chosen language model and SAE dictionary settings.

Key insights

SAE steering side effects are predictable from pre-intervention feature statistics, though specific predictors vary by model.

Principles

Steering modularity has two axes: effect stability and collateral spread.
Decoder geometry and co-activation predict steering cleanliness.
Predictor signatures are model- and dictionary-dependent.

Method

A pre-intervention screening framework forecasts SAE steering side effects. It uses feature statistics like decoder geometry, activation statistics, co-activation structure, and direct-logit footprint to predict effect stability and collateral spread.

In practice

Screen SAE features before steering for cleaner interventions.
Prioritize features with high predicted effect stability.
Prioritize features with low predicted collateral spread.

Topics

Sparse Autoencoders
Language Model Steering
Feature Modularity
Side Effect Prediction
GPT-2-small
Llama-3.1-8B

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.