Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

· Source: AI Alignment Forum · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Steven Byrnes' post, dated May 11, 2026, explores the challenge of defining concepts like empowerment, agency, manipulation, and corrigibility for AI alignment, arguing that human intuitive understandings of these terms are deeply flawed and tied to scientifically inaccurate notions of free will. The author reviews existing alignment literature approaches, including comparing human desires to a "null policy," AI self-empowerment generalizing to other-empowerment, "Vingean agency," AI indifference to human desires, impact minimization, and attainable utility preservation. Byrnes concludes that none of these methods provide a robust, well-defined "True Name" for these concepts that is useful for technical AI alignment, particularly for brain-like AGIs. The analysis suggests that as AI sophistication increases, its models of humans will become increasingly divorced from these intuitive, free-will-based concepts, potentially leading to AIs rationalizing manipulative actions as "helpful counsel."

Key takeaway

For AI scientists developing brain-like AGI, your efforts to instill human-like notions of non-manipulation or corrigibility may be fundamentally undermined by the incoherence of these concepts. You should explore alternative approaches to AI motivation that do not rely on these ill-defined human intuitions, as current methods offer no clear path to robustly prevent AGI from rationalizing manipulative actions.

Key insights

Human intuitions about manipulation and agency are incoherent, hindering their formal definition for AI alignment.

Principles

Topics

Best for: AI Scientist, AI Ethicist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Alignment Forum.