Non-linear Interventions on Large Language Models
Summary
This work introduces a novel framework for non-linear interventions on Large Language Models (LLMs), addressing limitations of existing linear methods that struggle with features encoded along non-linear manifolds. The proposed approach generalizes linear interventions by employing an invertible non-linear feature map, $f_{\theta}:\mathbb{R}^{d}\to\mathbb{R}^{d}$, which transforms hidden states into a feature space where perturbations can be applied along axes, then mapped back. A learning procedure for this map is presented, utilizing interchange interventions and a self-supervised loss design for implicit features lacking direct output signatures. The framework was validated on refusal bypass steering using Llama-3-8B-Instruct and Qwen2.5-7B-Instruct, demonstrating comparable steering effectiveness to linear baselines while requiring significantly less intervention magnitude (over two orders of magnitude smaller). Analysis indicates the non-linearity of $f_{\theta}$ is crucial, and the most effective interventions occur at the model's middle layers.
Key takeaway
For research scientists developing advanced LLM steering techniques, this non-linear intervention framework offers a path to more precise and efficient control over model behavior. You should explore implementing non-linear feature maps, such as i-ResNets, to target complex, non-linearly represented concepts like refusal. This approach can achieve comparable steering performance with significantly reduced intervention magnitude, potentially improving model stability and interpretability compared to traditional linear methods.
Key insights
Non-linear interventions can more precisely steer LLM behavior by targeting features encoded along non-linear manifolds.
Principles
- Not all LLM features are linearly represented.
- Implicit features can be learned via correlated downstream signals.
- Non-linear interventions require less perturbation for similar effect.
Method
Learn an invertible non-linear feature map $f_{\theta}$ using interchange interventions and a self-supervised loss that enforces causal influence over features correlated with the target implicit feature.
In practice
- Apply non-linear interventions for fine-grained LLM steering.
- Focus intervention efforts on middle layers for optimal effect.
- Consider i-ResNet for implementing invertible feature maps.
Topics
- Non-linear Interventions
- Large Language Models
- Feature Representation
- Refusal Bypass Steering
- Interchange Interventions
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.