Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

2026-05-19 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Researchers from ETH Zürich, Georgia Institute of Technology, ELLIS Institute Finland, and Aalto University have identified and mitigated "preference instability" in large language model reward models. This instability causes models to produce contradictory preference assignments for semantically equivalent inputs, such as paraphrases, pattern injections, or backdoor triggers. The team attributes this to reward models over-relying on "unstable features" that are predictive but brittle. They use Sparse Autoencoders (SAEs) to isolate these unstable features in a sparse latent space, where benign and perturbed inputs activate distinct patterns. Building on this, they propose two SAE-based mitigation strategies: SAE Feature Steering, which suppresses anomalous features during inference, and SAE Residual Correction, which learns adaptive adjustments over SAE features. Both methods significantly reduce incorrect preference assignments on harmlessness and hallucination benchmarks, preserving benign performance without retraining the reward model. Code and data are available on GitHub.

Key takeaway

For research scientists developing or deploying large language models, understanding and addressing reward model preference instability is crucial. You should consider integrating Sparse Autoencoders (SAEs) into your diagnostic and mitigation workflows to detect and correct inconsistent preference assignments caused by subtle input variations. This approach allows for robustifying reward models against paraphrasing, pattern injection, and backdoor triggers without costly retraining, enhancing model trustworthiness and alignment with human values.

Key insights

Reward model preference instability stems from brittle features, detectable and correctable via Sparse Autoencoders without retraining.

Principles

Preference instability arises from over-reliance on unstable features.
Sparse Autoencoders disentangle stable and unstable features.
Intervention at the representation level can mitigate instability.

Method

Sparse Autoencoders (SAEs) map hidden states to a sparse latent space. Unstable features are identified by anomalous activation shifts. Mitigation involves either suppressing these features (SAE Feature Steering) or learning adaptive corrections (SAE Residual Correction) at inference.

In practice

Use SAEs to diagnose reward model vulnerabilities.
Apply SAE Feature Steering for concentrated trigger-based attacks.
Employ SAE Residual Correction for adaptive, robust preference recovery.

Topics

Reward Model Instability
Sparse Autoencoders
Preference Learning
Semantic-Preserving Perturbations
SAE Feature Steering

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.