Teaching to the Test: Why Reward Models Learn the Dataset, Not the Values

2026-06-21 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Safety & Alignment · Depth: Expert, medium

Summary

A June 12, 2026 study by NUS, VinUniversity, and NTU, titled "When In-Distribution Gains Fail," found that weak-to-strong reward models do not reliably generalize. Models learn dataset quirks, not general preferences; this is called "representation drift." To address this, authors introduced "Representation Anchoring." This method penalizes feature drift from the pretrained model during training, adding no inference cost. The study used RAIL, Anthropic Harmless, and PKU-SafeRLHF as harmlessness benchmarks. RAIL proved a challenging target. Only the anchored model achieved positive transfer gains in difficult settings. For example, a Llama-3.1-8B student trained on Anthropic Harmless transferred positively to RAIL.

Key takeaway

For Machine Learning Engineers evaluating reward models, relying solely on in-distribution performance is insufficient. You should always evaluate your reward models on at least one independent, held-out dataset in the same category. Use metrics like Net Transfer Score that penalize source-domain regression. Incorporating a values-grounded dataset, such as RAIL, provides a robust, independent axis to verify true alignment beyond stylistic patterns. This approach ensures your models generalize effectively and align with intended values.

Key insights

Reward models learn dataset specifics, not general values, hindering out-of-distribution transfer unless representation drift is anchored.

Principles

In-distribution gains do not predict OOD transfer.
Representation drift causes poor generalization.
Values-grounded datasets test true alignment.

Method

Representation Anchoring penalizes feature drift from a frozen pretrained model during training. This preserves general preference features, improving out-of-distribution transfer without added inference cost.

In practice

Evaluate reward models on held-out datasets.
Use metrics penalizing source-domain regression.
Incorporate values-grounded sets like RAIL.

Topics

Weak-to-Strong Generalization
Reward Models
Representation Anchoring
RAIL Dataset
Out-of-Distribution Transfer
AI Alignment

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.