Emergent alignment and the projectability of ethical personas
Summary
This paper explores "emergent alignment" in large language models (LLMs), a phenomenon where finetuning on narrow safety tasks induces broader aligned behavior, contrasting with "emergent misalignment." It reinforces the "persona selection model" (PSM) hypothesis, suggesting LLMs simulate diverse characters during pre-training. Researchers finetuned a helpful-only model using the "Constitutional AI" (CAI) approach, applying four ethical constitutions: deontology, consequentialism, virtue ethics, and human authority. They demonstrated that finetuning on two narrow safety sub-categories reliably led to emergent alignment across a representative set of general safety categories, including those filtered from the training data. A multidimensional "ethical persona" diagnostic revealed that CAI models acquired their intended ethical profiles, such as consequentialist-tuned models aligning more with utilitarian beliefs. However, significant differences in "projectability" were observed across broad and narrow finetuned CAI models, leading to the conclusion that alignment strategies require evaluation based on both general safety performance and their degree of projectability.
Key takeaway
For AI Scientists developing ethical LLMs, you should integrate "projectability" as a key metric alongside general safety performance when evaluating alignment strategies. Your finetuning efforts, especially with Constitutional AI, must ensure the intended ethical persona is consistently projected, not just broadly aligned. This approach helps validate that the model's behavior truly reflects its designed ethical framework, preventing subtle misalignments.
Key insights
Finetuning LLMs on narrow safety tasks can induce broad ethical alignment, but the "projectability" of these ethical personas varies.
Principles
- LLMs simulate characters during pre-training.
- Narrow finetuning can induce broad alignment.
- Evaluate alignment on projectability, not just safety.
Method
Finetune helpful-only LLMs using Constitutional AI with ethical constitutions (deontology, consequentialism, virtue ethics, human authority) on narrow safety tasks, then evaluate using a multidimensional ethical persona diagnostic.
In practice
- Apply CAI with specific ethical frameworks.
- Test model persona consistency with diagnostics.
- Prioritize projectability in alignment metrics.
Topics
- Emergent Alignment
- Persona Selection Model
- Constitutional AI
- LLM Finetuning
- Ethical AI
- Model Projectability
Best for: Research Scientist, AI Scientist, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.