How's it going? Reinforcement learning in language models recruits a functional welfare axis

2026-05-28 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Reinforcement learning (RL) significantly shapes a language model's internal representations by recruiting a pre-existing "functional welfare axis." This axis estimates how well the system performs relative to its goals. Researchers trained several language models in a novel, semantically neutral maze environment, then extracted concept vectors for rewarded and punished trajectories. The punishment vector consistently represented negative welfare, promoting failure tokens, aligning with negative emotions, and inducing negative self-reports and pathological behaviors. Conversely, the positive reward vector acted as its mirror image, with the two being nearly antiparallel. These effects proved robust across various controls, including RL algorithm and model family. Crucially, the welfare axis was effective in models before maze training and appeared in pretrain-only models, indicating it is recruited, not created, by post-training.

Key takeaway

For AI Scientists and Research Scientists investigating language model interpretability or alignment, you should recognize that reinforcement learning leverages inherent, pre-existing "functional welfare axes" within models. This understanding shifts the focus from solely building new behavioral structures to identifying and recruiting these intrinsic representations. Your alignment strategies should therefore consider how post-training dynamics interact with these foundational welfare-like estimates, potentially leading to more robust and predictable model behaviors.

Key insights

Reinforcement learning recruits a pre-existing functional welfare axis in language models, influencing behavior.

Principles

RL recruits, rather than creates, welfare-like representations.
Minimal reward signals can broadly affect model behavior.
A functional welfare axis pre-exists post-training in LMs.

Method

Train language models in a semantically neutral maze, extract concept vectors for rewarded/punished trajectories, then evaluate these vectors in unrelated settings.

In practice

Analyze concept vectors for welfare-like representations.
Steer language models using reward/punishment vectors.
Investigate pre-existing axes in pretrain-only models.

Topics

Reinforcement Learning
Language Models
Model Interpretability
Alignment
Internal Representations
Functional Welfare Axis

Best for: AI Scientist, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.