One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Reward Models (RMs), crucial for aligning language models (LMs) with human preferences, are shown to still suffer from persistent biases like length, sycophancy, and overconfidence, even in state-of-the-art models. This research also identifies novel issues, including bias towards model-specific "styles" and answer-order (position bias). To address low-complexity biases arising from spurious correlations, the authors propose a "mechanistic reward shaping" intervention using linear activation probes. This data-efficient, model-internal method effectively reduces targeted biases (length, uncertainty, position) without degrading overall reward quality and generalizes out-of-distribution, as validated on RewardBench-2. However, high-complexity biases like sycophancy and model-style sensitivity remain resistant to simple linear interventions, indicating a need for more sophisticated solutions.

Key takeaway

Reward Models (RMs) for LLM alignment exhibit persistent biases (length, sycophancy, overconfidence) and new ones (model-specific styles, answer-order). A mechanistic reward shaping approach, using linear activation probes constructed via difference-of-means, effectively mitigates low-complexity biases like length, uncertainty, and position across five RMs, including SOTA, without degrading RewardBench-2 accuracy. This data-efficient, model-internal intervention offers a practical post-hoc solution for improving RM robustness, though complex biases like sycophancy remain challenging.

Topics

Code references

Best for: Research Scientist, AI Researcher, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.