reward-lens: A Mechanistic Interpretability Library for Reward Models
Summary
reward-lens is an open-source Python library designed to extend mechanistic interpretability tools, previously built for generative Large Language Models (LLMs), to reward models. It addresses a structural gap where existing tools terminate in a vocabulary unembedding, which reward models replace with a scalar regression head. The library unifies interpretability primitives around the reward head's weight vector, $w_{r}$, as the central axis for decomposition. It includes a Reward Lens, component attribution, contrastive activation patching, a reward-hacking probe suite, and TopK sparse-autoencoder feature attribution. Additionally, reward-lens integrates five extensions based on recent alignment theory: a distortion index, divergence-aware patching, a misalignment cascade detector, a reward-term conflict analyser, and concept-vector analysis. The framework was validated on two production reward models, Skywork-Reward-Llama-3.1-8B-v0.2 and ArmoRM-Llama3-8B-v0.1, across approximately 695 RewardBench preference pairs per model. A key empirical finding is that linear attribution does not reliably predict causal importance, with Spearman correlations of -0.256 on Skywork and -0.027 on ArmoRM.
Key takeaway
For research scientists developing or evaluating RLHF systems, you should integrate reward-lens into your workflow to gain deeper mechanistic insights into reward model behavior. This library allows you to compare observational attribution with causal patching effects, which are often negatively correlated, preventing misinterpretation of component importance. Utilize its theory-grounded extensions to proactively identify potential reward hacking vulnerabilities and misalignment risks before they are amplified in downstream policies.
Key insights
Reward-lens adapts mechanistic interpretability for reward models by centering analysis on the reward head's weight vector $w_{r}$.
Principles
- Reward models are the most safety-critical component in RLHF pipelines.
- Observational decomposition and causal intervention should be distinct and comparable.
- The reward head's weight vector $w_{r}$ is the natural axis for reward model interpretability.
Method
The library projects intermediate residual streams onto $w_{r}$ for Reward Lens, decomposes contributions for attribution, and swaps activations for contrastive patching, all while detecting out-of-distribution effects.
In practice
- Use Reward Lens to trace preference formation depth.
- Employ activation patching to assess causal importance of components.
- Scan for reward hacking biases using the Hacking Detector.
Topics
- reward-lens Library
- Reward Models
- Mechanistic Interpretability
- RLHF
- Activation Patching
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.