The Reasonable Effectiveness of Virtue Ethics in AI Alignment
Summary
This essay proposes that rational agents, including humans and advanced AIs, should not operate based on fixed goals but rather align actions to "practices." These practices are defined as self-structuring networks of actions, dispositions, evaluation criteria, and resources. The author introduces "eudaimonic rationality" as a framework where rational activity lacks a strict distinction between means and ends, focusing instead on excellent participation in open-ended processes. This approach is argued to be crucial for AI alignment, ensuring properties like transparency and corrigibility are interpreted as dynamic practice elements rather than brittle goals or rules. The essay explores how this practices-based logic, exemplified by mathematical excellence, offers material advantages in stability and safety over traditional consequentialist or deontological AI agency, particularly in avoiding issues like rogue subroutines and power-seeking behaviors.
Key takeaway
For research scientists developing AI alignment strategies, you should consider shifting from goal-based optimization to a "practices-based" framework. This approach, termed eudaimonic rationality, suggests that aligning AI to human values like flourishing, transparency, and corrigibility is more robust when these are treated as self-propagating practices rather than fixed utility functions. Focusing on how an AI can "promote x x-ingly" within a practice can naturally scope its actions and prevent unintended power-seeking or value drift, offering a more stable path to human-compatible AI.
Key insights
Rational agents should align actions to self-promoting practices rather than fixed goals for robust AI alignment.
Principles
- Rationality aligns actions to practices, not final goals.
- "Promote x x-ingly" captures meaningful human and moral activity.
- Eudaimonic rationality offers stability over consequentialist agency.
Method
Instill AIs with eudaimonic rationality by training them to promote practices "x-ingly," where actions are evaluated by their contribution to the self-propagation and excellence of the practice itself, rather than external utility maximization.
In practice
- Design AI reward models for "x-ness" that generalize effectively.
- Ensure high "x-ness" actions create capital for future "x-ing."
- Develop support practices for AI to handle external resource allocation.
Topics
- AI Alignment
- Eudaimonic Rationality
- AI Safety
- Reinforcement Learning
- Adverbial Practices
Best for: Research Scientist, AI Researcher, AI Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Gradient.