The Value Axis: Language Models Encode Whether They're on the Right Track

2026-06-15 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

The paper "The Value Axis: Language Models Encode Whether They're on the Right Track" investigates if language models internally track the likelihood of achieving their goals. Researchers constructed a "value" axis for Qwen3-8B using synthetic, in-context reinforcement learning data. Activations along this axis distinguish between high versus low verbalized confidence, rollouts with and without backtracking, and correct versus corrupted code. Steering the model towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. Direct preference optimization (DPO) can increase the internal value of rewarded behaviors, causing the model to act more confidently. The study also found Qwen assigns low value to politically sensitive chat queries post-training and that supervised fine-tuning increases internal confidence within its training domain. These results suggest language models linearly encode an estimate of expected goal success that modulates their confidence.

Key takeaway

For Machine Learning Engineers focused on fine-tuning and steering large language models, this research reveals a "value" axis that correlates with goal success and confidence. You can potentially steer models towards higher internal value to reduce verbosity and self-correction, or lower value to encourage exploration. Consider how direct preference optimization (DPO) can be used to instill specific internal value responses for desired behaviors, enhancing model reliability or adaptability in specific contexts.

Key insights

Language models internally track their trajectory's likelihood of goal success via a linearly encoded "value" axis.

Principles

LMs linearly encode expected goal success.
Internal value modulates model confidence.
Steering value influences exploration and self-correction.

Method

Construct a "value" axis using synthetic in-context reinforcement learning data, then analyze its correlation with verbalized confidence, backtracking, and code correctness.

In practice

DPO can increase internal value for desired behaviors.
Analyze value axis for politically sensitive queries.
Supervised fine-tuning boosts domain confidence.

Topics

Language Models
Internal Representations
Model Confidence
Reinforcement Learning
Direct Preference Optimization
Qwen3-8B
Interpretability

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.