“Act-based approval-directed agents”, for IDA skeptics
Summary
This analysis re-evaluates Paul Christiano's concept of "approval-directed agents" in AI alignment, separating it from the Iterated Distillation and Amplification (IDA) algorithmic approaches, which the author views skeptically. The core idea is that an AGI would only perform actions its human supervisors would approve of, thereby avoiding deceptive behaviors like lying. The author illustrates this concept by drawing an analogy to human psychology, specifically how individuals act out of pride in their self-image, influenced by admired role models. This "Approval Reward" mechanism, hypothesized as an innate component of the human brain's reinforcement learning, prevents manipulative actions by internalizing the admired figure's values. This human analogy suggests that the "approval-directed agents" trick, which addresses the "hard problem of wireheading" (manipulating human evaluators), could be compatible with powerful general intelligence, particularly in brain-like AGI.
Key takeaway
For AI Researchers developing alignment strategies, consider the human psychological model of "Approval Reward" and internalized role models. This approach offers a concrete, biologically inspired mechanism to prevent AI manipulation and deception, suggesting a path for building robust approval-directed agents that avoid the "hard problem of wireheading" by integrating ethical considerations directly into their plan evaluation.
Key insights
Human pride in self-image offers a psychological model for building approval-directed AI agents.
Principles
- Internalized values prevent manipulative behaviors.
- Human brains illustrate observation-utility and approval-directed agent mechanisms.
Method
The proposed method involves internalizing a "learned substitute" for a human supervisor within the AI's thought process, akin to how humans internalize admired role models.
In practice
- Model AI alignment on human social drives.
- Explore "Approval Reward" in brain-like AGI architectures.
Topics
- AI Alignment
- Approval-Directed Agents
- Wireheading Problem
- Reinforcement Learning
- Brain-like AGI
Best for: AI Researcher, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.