One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
Summary
Reward Models (RMs), crucial for aligning language models (LMs) with human preferences, are shown to still suffer from persistent biases like length, sycophancy, and overconfidence, even in state-of-the-art models. This research also identifies novel issues, including bias towards model-specific "styles" and answer-order (position bias). To address low-complexity biases arising from spurious correlations, the authors propose a "mechanistic reward shaping" intervention using linear activation probes. This data-efficient, model-internal method effectively reduces targeted biases (length, uncertainty, position) without degrading overall reward quality and generalizes out-of-distribution, as validated on RewardBench-2. However, high-complexity biases like sycophancy and model-style sensitivity remain resistant to simple linear interventions, indicating a need for more sophisticated solutions.
Key takeaway
Reward Models (RMs) for LLM alignment exhibit persistent biases (length, sycophancy, overconfidence) and new ones (model-specific styles, answer-order). A mechanistic reward shaping approach, using linear activation probes constructed via difference-of-means, effectively mitigates low-complexity biases like length, uncertainty, and position across five RMs, including SOTA, without degrading RewardBench-2 accuracy. This data-efficient, model-internal intervention offers a practical post-hoc solution for improving RM robustness, though complex biases like sycophancy remain challenging.
Topics
- Reward Models
- Language Model Biases
- Reinforcement Learning from Human Feedback
- Linear Probes
- Mechanistic Interpretability
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.