Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning
Summary
A new method enables causal intervention on continuous variables within language model representations, extending previous work that focused on discrete features like grammatical number. This technique identifies a low-dimensional direction for a graded target variable from activation vectors, then uses this direction to modify vectors towards counterfactual values. Researchers applied this to verb bias, a continuous psycholinguistic feature indicating syntactic structure preferences after a verb. The study demonstrates that verb bias is causally encoded in steering vectors extracted from large language models, with counterfactual edits to verb bias systematically altering downstream structural preferences. While steering vectors also contain error signals potentially driving in-context learning's error-driven updates, these specific aspects are not causally utilized in downstream production. The findings confirm the applicability of causal interventions to continuous variables, though fully linking them to in-context learning remains complex.
Key takeaway
For AI Scientists investigating language model interpretability, this research shows you can causally intervene on continuous internal representations. You should consider applying this method to other graded linguistic or semantic features to understand their influence on model behavior. While connecting these interventions to in-context learning remains challenging, your ability to precisely manipulate features like verb bias offers new avenues for probing model mechanisms. This can improve control over model outputs.
Key insights
Causal interventions can effectively manipulate continuous variables like verb bias in language model steering vectors.
Principles
- Verb bias is causally represented in LLM steering vectors.
- Counterfactual edits shift downstream structural preferences.
Method
Localize a low-dimensional direction for a graded target variable from activation vectors, then edit vectors towards counterfactual target values using this direction.
In practice
- Manipulate continuous linguistic features in LLMs.
- Investigate causal links between internal states and output.
Topics
- Causal Intervention
- Continuous Variables
- Language Models
- Steering Vectors
- Verb Bias
- In-Context Learning
Best for: NLP Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.