How's it going? Reinforcement learning in language models recruits a functional welfare axis

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Reinforcement learning (RL) significantly shapes a language model's internal representations by recruiting a pre-existing "functional welfare axis." This axis estimates how well the system performs relative to its goals. Researchers trained several language models in a novel, semantically neutral maze environment, then extracted concept vectors for rewarded and punished trajectories. The punishment vector consistently represented negative welfare, promoting failure tokens, aligning with negative emotions, and inducing negative self-reports and pathological behaviors. Conversely, the positive reward vector acted as its mirror image, with the two being nearly antiparallel. These effects proved robust across various controls, including RL algorithm and model family. Crucially, the welfare axis was effective in models before maze training and appeared in pretrain-only models, indicating it is recruited, not created, by post-training.

Key takeaway

For AI Scientists and Research Scientists investigating language model interpretability or alignment, you should recognize that reinforcement learning leverages inherent, pre-existing "functional welfare axes" within models. This understanding shifts the focus from solely building new behavioral structures to identifying and recruiting these intrinsic representations. Your alignment strategies should therefore consider how post-training dynamics interact with these foundational welfare-like estimates, potentially leading to more robust and predictable model behaviors.

Key insights

Reinforcement learning recruits a pre-existing functional welfare axis in language models, influencing behavior.

Principles

Method

Train language models in a semantically neutral maze, extract concept vectors for rewarded/punished trajectories, then evaluate these vectors in unrelated settings.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.