Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Aligned language models gate behaviors like refusal and language routing through sparse feed-forward neurons, but a theory predicting coherent single-neuron control without output collapse has been lacking. This work introduces a budget-normalized control-window framework for single-neuron steering. It defines a control coordinate based on the alignment between the residual stream and the write direction, driven by a universal saturation curve. Coherent control exists when a behavior trigger remains below a "collapse ceiling." On fifteen held-out neurons, the predicted ceiling achieved a mean absolute error of 0.14 (0.07 in bulk layers), with eleven out of fifteen verdicts holding. The framework identifies three failure modes: collapse before trigger, insufficient propagation depth, or normalization capping neuron influence. It also explains why local gradient attribution anti-predicts control, proposing a forward-only contrastive screen to recover true controllers. For refusal, intervention success is typed, separating coherent bypass from strict actionable reach, with genuine actionable reach appearing only for three of six audited Llama pivots at later rollout horizons.

Key takeaway

For NLP Engineers auditing or steering language model behaviors like refusal or language routing, you should adopt a budgeted, typed approach to single-neuron interventions. Relying solely on local gradient attribution is misleading; instead, use the control-window law to predict coherent control and identify true controllers via a forward-only contrastive screen. This prevents output collapse and ensures interventions yield actionable, specific results rather than generic bypasses.

Key insights

A "control-window law" predicts single-neuron steering coherence in LLMs, distinguishing effective interventions from output collapse.

Principles

Method

The control-window framework normalizes intervention dose to a control coordinate, aligning residual stream and write, driven by a saturation curve. A forward-only contrastive screen recovers controllers.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.