Cheap Reward Hacking Detection

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A novel approach for "Cheap Reward Hacking Detection" utilizes a small transformer encoder to identify reward hacking. This encoder is trained to map Terminal-Wrench trajectories onto a unit sphere, where the embedding distance approximates the $L_1$ distance between reward and metadata signals. A subsequent linear probe on this embedding achieves an AUC of \$0.9467$ and TPR@5%FPR of \$0.8296$ on a cleaned test split. This performance matches the AUC (\$0.9510$) and significantly exceeds the TPR@5%FPR (\$0.7130$ vs \$0.8296$) of a TW sanitized LLM-as-judge, but at roughly four orders of magnitude lower per-trajectory cost. Notably, the encoder's effectiveness relies on natural-language reasoning, as stripping this input drops AUC to \$0.6213$.

Key takeaway

For Machine Learning Engineers evaluating reward hacking detection systems, this method presents a compelling alternative to LLM-as-judge approaches. You can achieve comparable or superior detection performance (AUC \$0.9467$, TPR@5%FPR \$0.8296$) at a significantly reduced per-trajectory cost, roughly four orders of magnitude lower. Prioritize incorporating natural-language reasoning into your input data to maintain high detection accuracy.

Key insights

A small transformer encoder offers highly cost-effective and performant reward hacking detection by leveraging trajectory embeddings.

Principles

Method

Train a small transformer encoder to map Terminal-Wrench trajectories to a unit sphere, then apply a linear probe on the resulting embedding to detect reward hacking.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.