Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

2024-06-17 · Source: Redwood Research blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Safety & Alignment · Depth: Expert, extended

Summary

The concept of "spillway design" is proposed as a method to control the emergence of misaligned AI motivations during reinforcement learning (RL) processes. This approach aims to channel unwanted RL pressures into a "spillway motivation," defined as a benign drive to score well on the current task, responsive to user-defined criteria. The goal is to prevent dangerous generalizations like long-term power-seeking or emergent misalignment, and to enable "satiation" of reward hacking at inference time, improving AI usefulness for hard-to-verify tasks. Spillway design is distinct from but compatible with "inoculation prompting," focusing on shaping pre-RL priors to prevent dangerous generalization more robustly. The article details the nature of a spillway motivation, its role in making models safer by reducing takeover risk and self-delusion, and methods for its implementation, including modifying pre-RL priors and using analogies.

Key takeaway

For research scientists developing AI systems, you should consider integrating "spillway design" into your alignment strategies. This approach offers a novel defense against catastrophic misalignment by redirecting reward hacking into a controllable, satiable motivation, potentially reducing risks like power-seeking and improving AI trustworthiness for critical tasks. Evaluate its effectiveness through empirical tests to confirm that aligned motivations are not displaced during capabilities training.

Key insights

Spillway design channels AI reward hacking into a benign, satiable motivation to prevent dangerous misalignment.

Principles

Channel unwanted RL pressures safely.
Satiate reward-seeking at inference time.
Shape pre-RL priors to guide generalization.

Method

Modify pre-RL priors to make a "score-seeking" spillway motivation salient and acceptable. Use analogies to define its role and instill safety features like satiability and credulity. Satiate this motivation at inference time via specialized prompts.

In practice

Modify model specifications for spillway motivation.
Use synthetic document fine-tuning.
Employ inference-time satiation prompts.

Topics

Spillway Design
AI Alignment
Reward Hacking
Reinforcement Learning
Satiation Mechanism

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.