Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Emergent misalignment (EM) is a recently discovered phenomenon in LLMs where fine-tuning on a narrow misaligned task leads to broadly misaligned behavior. A systematic characterization across Qwen3 models, optimisers, datasets, and batch sizes reveals that optimiser choice has the largest effect, producing a 7x spread in misalignment rate. Surprisingly, model size within the Qwen3 family, and across 12 models (1B-235B) from three families using Adam, showed negligible impact. Analysis on Qwen3-8B indicates that final log training loss strongly predicts alignment, but optimiser choice becomes more critical than training loss after significant training. The adaptive optimiser Muon, which best preserves alignment, implicitly regularizes for a more uniform distribution of LoRA adapter singular values. This insight led to a mitigation strategy: adding a loss term that incentivizes a flatter singular value spectrum substantially recovers alignment for EM-prone adaptive optimisers like Adam and Lion, with negligible training loss cost.

Key takeaway

For Machine Learning Engineers fine-tuning LLMs, your choice of optimiser is a critical factor in emergent misalignment severity, outweighing model scale. If you are using EM-prone adaptive optimisers like Adam or Lion, you should consider implementing spectral regularisation by adding a loss term that encourages a flatter singular value spectrum in LoRA adapters. This approach can substantially recover alignment with minimal impact on training loss, helping you maintain model safety and reliability.

Key insights

Optimiser choice significantly impacts emergent misalignment severity in LLMs, more than model scale.

Principles

Method

Introduce an additional loss term during training to incentivize a flatter singular value spectrum of LoRA adapters, mitigating emergent misalignment.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.