Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A new method, Activation-LQR (A-LQR), enhances inference-time LLM alignment by modeling transformer blocks as locally-linear, time-varying dynamical systems. This approach adapts the classical linear quadratic regulator (LQR) to compute feedback controllers using layer-wise Jacobians, steering activations towards desired semantic setpoints without offline training. The method introduces a novel adaptive semantic feature setpoint (LFS) signal, enabling robust, fine-grained behavior control across various LLM architectures and scales. A-LQR achieves state-of-the-art results in modulating toxicity (up to 50x reduction), improving truthfulness (up to 17% T*I score increase for Llama-3-8B), and enabling arbitrary concept elicitation. It also demonstrates utility in mechanistic jailbreaking, particularly with its A-LQR+ variant, which intervenes across all token positions.

Key takeaway

For AI Engineers and Research Scientists developing or deploying LLMs, consider integrating A-LQR for fine-grained, inference-time behavior control. This method offers a training-free approach to enhance safety and reliability, significantly reducing toxicity and improving truthfulness while preserving model coherence. Its ability to adaptively steer activations based on online feedback provides a robust alternative to traditional fine-tuning, especially for high-stakes applications where dynamic alignment is critical.

Key insights

LLM transformer dynamics are locally linear, enabling efficient, model-based optimal control for activation steering.

Principles

Method

Model LLM inference as a linear time-varying dynamical system and adapt the linear quadratic regulator (LQR) with layer-wise Jacobians to compute feedback controllers, steering activations toward adaptive semantic feature setpoints (LFS).

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.