reward-lens: A Mechanistic Interpretability Library for Reward Models

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Alignment, Mechanistic Interpretability · Depth: Expert, extended

Summary

reward-lens is an open-source Python library designed to extend mechanistic interpretability tools, previously built for generative Large Language Models (LLMs), to reward models. It addresses a structural gap where existing tools terminate in a vocabulary unembedding, which reward models replace with a scalar regression head. The library unifies interpretability primitives around the reward head's weight vector, $w_{r}$, as the central axis for decomposition. It includes a Reward Lens, component attribution, contrastive activation patching, a reward-hacking probe suite, and TopK sparse-autoencoder feature attribution. Additionally, reward-lens integrates five extensions based on recent alignment theory: a distortion index, divergence-aware patching, a misalignment cascade detector, a reward-term conflict analyser, and concept-vector analysis. The framework was validated on two production reward models, Skywork-Reward-Llama-3.1-8B-v0.2 and ArmoRM-Llama3-8B-v0.1, across approximately 695 RewardBench preference pairs per model. A key empirical finding is that linear attribution does not reliably predict causal importance, with Spearman correlations of -0.256 on Skywork and -0.027 on ArmoRM.

Key takeaway

For research scientists developing or evaluating RLHF systems, you should integrate reward-lens into your workflow to gain deeper mechanistic insights into reward model behavior. This library allows you to compare observational attribution with causal patching effects, which are often negatively correlated, preventing misinterpretation of component importance. Utilize its theory-grounded extensions to proactively identify potential reward hacking vulnerabilities and misalignment risks before they are amplified in downstream policies.

Key insights

Reward-lens adapts mechanistic interpretability for reward models by centering analysis on the reward head's weight vector $w_{r}$.

Principles

Method

The library projects intermediate residual streams onto $w_{r}$ for Reward Lens, decomposes contributions for attribution, and swaps activations for contrastive patching, all while detecting out-of-distribution effects.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.