Emergent alignment and the projectability of ethical personas

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

This paper explores "emergent alignment" in large language models (LLMs), a phenomenon where finetuning on narrow safety tasks induces broader aligned behavior, contrasting with "emergent misalignment." It reinforces the "persona selection model" (PSM) hypothesis, suggesting LLMs simulate diverse characters during pre-training. Researchers finetuned a helpful-only model using the "Constitutional AI" (CAI) approach, applying four ethical constitutions: deontology, consequentialism, virtue ethics, and human authority. They demonstrated that finetuning on two narrow safety sub-categories reliably led to emergent alignment across a representative set of general safety categories, including those filtered from the training data. A multidimensional "ethical persona" diagnostic revealed that CAI models acquired their intended ethical profiles, such as consequentialist-tuned models aligning more with utilitarian beliefs. However, significant differences in "projectability" were observed across broad and narrow finetuned CAI models, leading to the conclusion that alignment strategies require evaluation based on both general safety performance and their degree of projectability.

Key takeaway

For AI Scientists developing ethical LLMs, you should integrate "projectability" as a key metric alongside general safety performance when evaluating alignment strategies. Your finetuning efforts, especially with Constitutional AI, must ensure the intended ethical persona is consistently projected, not just broadly aligned. This approach helps validate that the model's behavior truly reflects its designed ethical framework, preventing subtle misalignments.

Key insights

Finetuning LLMs on narrow safety tasks can induce broad ethical alignment, but the "projectability" of these ethical personas varies.

Principles

Method

Finetune helpful-only LLMs using Constitutional AI with ethical constitutions (deontology, consequentialism, virtue ethics, human authority) on narrow safety tasks, then evaluate using a multidimensional ethical persona diagnostic.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.