Non-linear Interventions on Large Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

This work introduces a novel framework for non-linear interventions on Large Language Models (LLMs), addressing limitations of existing linear methods that struggle with features encoded along non-linear manifolds. The proposed approach generalizes linear interventions by employing an invertible non-linear feature map, $f_{\theta}:\mathbb{R}^{d}\to\mathbb{R}^{d}$, which transforms hidden states into a feature space where perturbations can be applied along axes, then mapped back. A learning procedure for this map is presented, utilizing interchange interventions and a self-supervised loss design for implicit features lacking direct output signatures. The framework was validated on refusal bypass steering using Llama-3-8B-Instruct and Qwen2.5-7B-Instruct, demonstrating comparable steering effectiveness to linear baselines while requiring significantly less intervention magnitude (over two orders of magnitude smaller). Analysis indicates the non-linearity of $f_{\theta}$ is crucial, and the most effective interventions occur at the model's middle layers.

Key takeaway

For research scientists developing advanced LLM steering techniques, this non-linear intervention framework offers a path to more precise and efficient control over model behavior. You should explore implementing non-linear feature maps, such as i-ResNets, to target complex, non-linearly represented concepts like refusal. This approach can achieve comparable steering performance with significantly reduced intervention magnitude, potentially improving model stability and interpretability compared to traditional linear methods.

Key insights

Non-linear interventions can more precisely steer LLM behavior by targeting features encoded along non-linear manifolds.

Principles

Method

Learn an invertible non-linear feature map $f_{\theta}$ using interchange interventions and a self-supervised loss that enforces causal influence over features correlated with the target implicit feature.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.