Emergent alignment and the projectability of ethical personas

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

This paper explores "emergent alignment" in large language models (LLMs), a phenomenon where finetuning on narrow safety tasks induces broader aligned behavior, contrasting with "emergent misalignment." It reinforces the "persona selection model" (PSM) hypothesis, suggesting LLMs simulate diverse characters during pre-training. Researchers finetuned a helpful-only model using the "Constitutional AI" (CAI) approach, applying four ethical constitutions: deontology, consequentialism, virtue ethics, and human authority. They demonstrated that finetuning on two narrow safety sub-categories reliably led to emergent alignment across a representative set of general safety categories, including those filtered from the training data. A multidimensional "ethical persona" diagnostic revealed that CAI models acquired their intended ethical profiles, such as consequentialist-tuned models aligning more with utilitarian beliefs. However, significant differences in "projectability" were observed across broad and narrow finetuned CAI models, leading to the conclusion that alignment strategies require evaluation based on both general safety performance and their degree of projectability.

Key takeaway

For AI Scientists developing ethical LLMs, you should integrate "projectability" as a key metric alongside general safety performance when evaluating alignment strategies. Your finetuning efforts, especially with Constitutional AI, must ensure the intended ethical persona is consistently projected, not just broadly aligned. This approach helps validate that the model's behavior truly reflects its designed ethical framework, preventing subtle misalignments.

Key insights

Finetuning LLMs on narrow safety tasks can induce broad ethical alignment, but the "projectability" of these ethical personas varies.

Principles

LLMs simulate characters during pre-training.
Narrow finetuning can induce broad alignment.
Evaluate alignment on projectability, not just safety.

Method

Finetune helpful-only LLMs using Constitutional AI with ethical constitutions (deontology, consequentialism, virtue ethics, human authority) on narrow safety tasks, then evaluate using a multidimensional ethical persona diagnostic.

In practice

Apply CAI with specific ethical frameworks.
Test model persona consistency with diagnostics.
Prioritize projectability in alignment metrics.

Topics

Emergent Alignment
Persona Selection Model
Constitutional AI
LLM Finetuning
Ethical AI
Model Projectability

Best for: Research Scientist, AI Scientist, AI Ethicist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.