Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Summary
A new method, Activation-LQR (A-LQR), enhances inference-time LLM alignment by modeling transformer blocks as locally-linear, time-varying dynamical systems. This approach adapts the classical linear quadratic regulator (LQR) to compute feedback controllers using layer-wise Jacobians, steering activations towards desired semantic setpoints without offline training. The method introduces a novel adaptive semantic feature setpoint (LFS) signal, enabling robust, fine-grained behavior control across various LLM architectures and scales. A-LQR achieves state-of-the-art results in modulating toxicity (up to 50x reduction), improving truthfulness (up to 17% T*I score increase for Llama-3-8B), and enabling arbitrary concept elicitation. It also demonstrates utility in mechanistic jailbreaking, particularly with its A-LQR+ variant, which intervenes across all token positions.
Key takeaway
For AI Engineers and Research Scientists developing or deploying LLMs, consider integrating A-LQR for fine-grained, inference-time behavior control. This method offers a training-free approach to enhance safety and reliability, significantly reducing toxicity and improving truthfulness while preserving model coherence. Its ability to adaptively steer activations based on online feedback provides a robust alternative to traditional fine-tuning, especially for high-stakes applications where dynamic alignment is critical.
Key insights
LLM transformer dynamics are locally linear, enabling efficient, model-based optimal control for activation steering.
Principles
- Layer-wise dynamics are well-approximated by locally-linear models.
- Optimal control can minimize interventions while preserving LLM reasoning.
Method
Model LLM inference as a linear time-varying dynamical system and adapt the linear quadratic regulator (LQR) with layer-wise Jacobians to compute feedback controllers, steering activations toward adaptive semantic feature setpoints (LFS).
In practice
- Apply A-LQR for toxicity reduction and truthfulness improvement.
- Use A-LQR+ for more invasive control like jailbreaking.
- Precompute LQR gains for efficient online inference.
Topics
- Activation Steering
- Linear Quadratic Regulator
- LLM Dynamics
- Local Linearity
- LLM Alignment
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.