How's it going? Reinforcement learning in language models recruits a functional welfare axis
Summary
Reinforcement learning (RL) significantly shapes a language model's internal representations by recruiting a pre-existing "functional welfare axis." This axis estimates how well the system performs relative to its goals. Researchers trained several language models in a novel, semantically neutral maze environment, then extracted concept vectors for rewarded and punished trajectories. The punishment vector consistently represented negative welfare, promoting failure tokens, aligning with negative emotions, and inducing negative self-reports and pathological behaviors. Conversely, the positive reward vector acted as its mirror image, with the two being nearly antiparallel. These effects proved robust across various controls, including RL algorithm and model family. Crucially, the welfare axis was effective in models before maze training and appeared in pretrain-only models, indicating it is recruited, not created, by post-training.
Key takeaway
For AI Scientists and Research Scientists investigating language model interpretability or alignment, you should recognize that reinforcement learning leverages inherent, pre-existing "functional welfare axes" within models. This understanding shifts the focus from solely building new behavioral structures to identifying and recruiting these intrinsic representations. Your alignment strategies should therefore consider how post-training dynamics interact with these foundational welfare-like estimates, potentially leading to more robust and predictable model behaviors.
Key insights
Reinforcement learning recruits a pre-existing functional welfare axis in language models, influencing behavior.
Principles
- RL recruits, rather than creates, welfare-like representations.
- Minimal reward signals can broadly affect model behavior.
- A functional welfare axis pre-exists post-training in LMs.
Method
Train language models in a semantically neutral maze, extract concept vectors for rewarded/punished trajectories, then evaluate these vectors in unrelated settings.
In practice
- Analyze concept vectors for welfare-like representations.
- Steer language models using reward/punishment vectors.
- Investigate pre-existing axes in pretrain-only models.
Topics
- Reinforcement Learning
- Language Models
- Model Interpretability
- Alignment
- Internal Representations
- Functional Welfare Axis
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.