Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
Summary
Researchers from ETH Zürich, Georgia Institute of Technology, ELLIS Institute Finland, and Aalto University have identified and mitigated "preference instability" in large language model reward models. This instability causes models to produce contradictory preference assignments for semantically equivalent inputs, such as paraphrases, pattern injections, or backdoor triggers. The team attributes this to reward models over-relying on "unstable features" that are predictive but brittle. They use Sparse Autoencoders (SAEs) to isolate these unstable features in a sparse latent space, where benign and perturbed inputs activate distinct patterns. Building on this, they propose two SAE-based mitigation strategies: SAE Feature Steering, which suppresses anomalous features during inference, and SAE Residual Correction, which learns adaptive adjustments over SAE features. Both methods significantly reduce incorrect preference assignments on harmlessness and hallucination benchmarks, preserving benign performance without retraining the reward model. Code and data are available on GitHub.
Key takeaway
For research scientists developing or deploying large language models, understanding and addressing reward model preference instability is crucial. You should consider integrating Sparse Autoencoders (SAEs) into your diagnostic and mitigation workflows to detect and correct inconsistent preference assignments caused by subtle input variations. This approach allows for robustifying reward models against paraphrasing, pattern injection, and backdoor triggers without costly retraining, enhancing model trustworthiness and alignment with human values.
Key insights
Reward model preference instability stems from brittle features, detectable and correctable via Sparse Autoencoders without retraining.
Principles
- Preference instability arises from over-reliance on unstable features.
- Sparse Autoencoders disentangle stable and unstable features.
- Intervention at the representation level can mitigate instability.
Method
Sparse Autoencoders (SAEs) map hidden states to a sparse latent space. Unstable features are identified by anomalous activation shifts. Mitigation involves either suppressing these features (SAE Feature Steering) or learning adaptive corrections (SAE Residual Correction) at inference.
In practice
- Use SAEs to diagnose reward model vulnerabilities.
- Apply SAE Feature Steering for concentrated trigger-based attacks.
- Employ SAE Residual Correction for adaptive, robust preference recovery.
Topics
- Reward Model Instability
- Sparse Autoencoders
- Preference Learning
- Semantic-Preserving Perturbations
- SAE Feature Steering
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.