Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context Learning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computational Linguistics · Depth: Expert, quick

Summary

A new method enables causal intervention on continuous variables within language model representations, extending previous work that focused on discrete features like grammatical number. This technique identifies a low-dimensional direction for a graded target variable from activation vectors, then uses this direction to modify vectors towards counterfactual values. Researchers applied this to verb bias, a continuous psycholinguistic feature indicating syntactic structure preferences after a verb. The study demonstrates that verb bias is causally encoded in steering vectors extracted from large language models, with counterfactual edits to verb bias systematically altering downstream structural preferences. While steering vectors also contain error signals potentially driving in-context learning's error-driven updates, these specific aspects are not causally utilized in downstream production. The findings confirm the applicability of causal interventions to continuous variables, though fully linking them to in-context learning remains complex.

Key takeaway

For AI Scientists investigating language model interpretability, this research shows you can causally intervene on continuous internal representations. You should consider applying this method to other graded linguistic or semantic features to understand their influence on model behavior. While connecting these interventions to in-context learning remains challenging, your ability to precisely manipulate features like verb bias offers new avenues for probing model mechanisms. This can improve control over model outputs.

Key insights

Causal interventions can effectively manipulate continuous variables like verb bias in language model steering vectors.

Principles

Method

Localize a low-dimensional direction for a graded target variable from activation vectors, then edit vectors towards counterfactual target values using this direction.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.