Teaching to the Test: Why Reward Models Learn the Dataset, Not the Values
Summary
A June 12, 2026 study by NUS, VinUniversity, and NTU, titled "When In-Distribution Gains Fail," found that weak-to-strong reward models do not reliably generalize. Models learn dataset quirks, not general preferences; this is called "representation drift." To address this, authors introduced "Representation Anchoring." This method penalizes feature drift from the pretrained model during training, adding no inference cost. The study used RAIL, Anthropic Harmless, and PKU-SafeRLHF as harmlessness benchmarks. RAIL proved a challenging target. Only the anchored model achieved positive transfer gains in difficult settings. For example, a Llama-3.1-8B student trained on Anthropic Harmless transferred positively to RAIL.
Key takeaway
For Machine Learning Engineers evaluating reward models, relying solely on in-distribution performance is insufficient. You should always evaluate your reward models on at least one independent, held-out dataset in the same category. Use metrics like Net Transfer Score that penalize source-domain regression. Incorporating a values-grounded dataset, such as RAIL, provides a robust, independent axis to verify true alignment beyond stylistic patterns. This approach ensures your models generalize effectively and align with intended values.
Key insights
Reward models learn dataset specifics, not general values, hindering out-of-distribution transfer unless representation drift is anchored.
Principles
- In-distribution gains do not predict OOD transfer.
- Representation drift causes poor generalization.
- Values-grounded datasets test true alignment.
Method
Representation Anchoring penalizes feature drift from a frozen pretrained model during training. This preserves general preference features, improving out-of-distribution transfer without added inference cost.
In practice
- Evaluate reward models on held-out datasets.
- Use metrics penalizing source-domain regression.
- Incorporate values-grounded sets like RAIL.
Topics
- Weak-to-Strong Generalization
- Reward Models
- Representation Anchoring
- RAIL Dataset
- Out-of-Distribution Transfer
- AI Alignment
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.