Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models
Summary
Aligned language models gate behaviors like refusal and language routing through sparse feed-forward neurons, but a theory predicting coherent single-neuron control without output collapse has been lacking. This work introduces a budget-normalized control-window framework for single-neuron steering. It defines a control coordinate based on the alignment between the residual stream and the write direction, driven by a universal saturation curve. Coherent control exists when a behavior trigger remains below a "collapse ceiling." On fifteen held-out neurons, the predicted ceiling achieved a mean absolute error of 0.14 (0.07 in bulk layers), with eleven out of fifteen verdicts holding. The framework identifies three failure modes: collapse before trigger, insufficient propagation depth, or normalization capping neuron influence. It also explains why local gradient attribution anti-predicts control, proposing a forward-only contrastive screen to recover true controllers. For refusal, intervention success is typed, separating coherent bypass from strict actionable reach, with genuine actionable reach appearing only for three of six audited Llama pivots at later rollout horizons.
Key takeaway
For NLP Engineers auditing or steering language model behaviors like refusal or language routing, you should adopt a budgeted, typed approach to single-neuron interventions. Relying solely on local gradient attribution is misleading; instead, use the control-window law to predict coherent control and identify true controllers via a forward-only contrastive screen. This prevents output collapse and ensures interventions yield actionable, specific results rather than generic bypasses.
Key insights
A "control-window law" predicts single-neuron steering coherence in LLMs, distinguishing effective interventions from output collapse.
Principles
- Coherent control requires behavior triggers below a collapse ceiling.
- Local gradient attribution anti-predicts true neuron controllers.
- Intervention success is typed, not scalar, for complex behaviors.
Method
The control-window framework normalizes intervention dose to a control coordinate, aligning residual stream and write, driven by a saturation curve. A forward-only contrastive screen recovers controllers.
In practice
- Audit neuron controllability using a budgeted, typed approach.
- Identify failure modes like collapse or insufficient depth.
- Use contrastive screens to find true controllers.
Topics
- Language Models
- Neuron Steering
- Control Window Law
- Behavioral Control
- Gradient Attribution
- LLM Refusal
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.