The Value Axis: Language Models Encode Whether They're on the Right Track
Summary
The paper "The Value Axis: Language Models Encode Whether They're on the Right Track" investigates if language models internally track the likelihood of achieving their goals. Researchers constructed a "value" axis for Qwen3-8B using synthetic, in-context reinforcement learning data. Activations along this axis distinguish between high versus low verbalized confidence, rollouts with and without backtracking, and correct versus corrupted code. Steering the model towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. Direct preference optimization (DPO) can increase the internal value of rewarded behaviors, causing the model to act more confidently. The study also found Qwen assigns low value to politically sensitive chat queries post-training and that supervised fine-tuning increases internal confidence within its training domain. These results suggest language models linearly encode an estimate of expected goal success that modulates their confidence.
Key takeaway
For Machine Learning Engineers focused on fine-tuning and steering large language models, this research reveals a "value" axis that correlates with goal success and confidence. You can potentially steer models towards higher internal value to reduce verbosity and self-correction, or lower value to encourage exploration. Consider how direct preference optimization (DPO) can be used to instill specific internal value responses for desired behaviors, enhancing model reliability or adaptability in specific contexts.
Key insights
Language models internally track their trajectory's likelihood of goal success via a linearly encoded "value" axis.
Principles
- LMs linearly encode expected goal success.
- Internal value modulates model confidence.
- Steering value influences exploration and self-correction.
Method
Construct a "value" axis using synthetic in-context reinforcement learning data, then analyze its correlation with verbalized confidence, backtracking, and code correctness.
In practice
- DPO can increase internal value for desired behaviors.
- Analyze value axis for politically sensitive queries.
- Supervised fine-tuning boosts domain confidence.
Topics
- Language Models
- Internal Representations
- Model Confidence
- Reinforcement Learning
- Direct Preference Optimization
- Qwen3-8B
- Interpretability
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.