The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs
Summary
A study identifies "misfired alignment" in large language models (LLMs), a failure mode where safety-oriented behaviors cause models to override explicit contextual evidence, particularly concerning stereotype-related groups. Researchers introduced VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, and the Misfired Alignment Rate (MAR) metric to quantify this. Benchmarking 25 LLMs revealed MARs ranging from 4.7% to 18.9%, with frontier models like GPT-5.4 and Claude often exhibiting higher rates (9.9-18.9%), contrasting with human performance at 0.0% MAR. Controlled priming experiments showed that alignment-induced cues can amplify MAR up to 8.9 times. Mechanistic analyses on open-weight LLMs indicated late-layer suppression of correct, evidence-supported answers, emerging after instruction training, where intermediate layer reasoning is overridden in final layers.
Key takeaway
For AI Scientists and Machine Learning Engineers developing aligned LLMs, recognize that current alignment methods can inadvertently cause models to override explicit evidence in stereotype-sensitive contexts. This "misfired alignment" can lead to inconsistent or misleading outputs, particularly in high-stakes applications like healthcare or legal decision-making. You should prioritize developing principled alignment frameworks that balance fairness with robust contextual grounding, investigating and mitigating late-layer suppression mechanisms to ensure factual consistency.
Key insights
Current LLM alignment can overgeneralize safety cues, overriding explicit evidence in stereotype-sensitive contexts.
Principles
- Alignment can create an "alignment tax" on reasoning.
- Safety heuristics can override factual consistency.
- Instruction tuning amplifies misfired alignment.
Method
The VETO benchmark quantifies misfired alignment using 2,032 contrastive pairs and the Misfired Alignment Rate (MAR) metric, measuring failures on stereotype-related questions versus contrastive counterparts.
In practice
- Benchmark LLMs with VETO for evidence-grounding.
- Analyze late-layer logit differences for suppression.
- Ablate attention heads to pinpoint failure causes.
Topics
- Misfired Alignment
- LLM Alignment
- VETO Benchmark
- Bias Benchmarking
- Mechanistic Interpretability
- Instruction Tuning
Code references
Best for: Research Scientist, AI Scientist, AI Ethicist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.