The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

2025-08-20 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

A study identifies "misfired alignment" in large language models (LLMs), a failure mode where safety-oriented behaviors cause models to override explicit contextual evidence, particularly concerning stereotype-related groups. Researchers introduced VETO, a benchmark of 2,032 BBQ-derived contrastive pairs, and the Misfired Alignment Rate (MAR) metric to quantify this. Benchmarking 25 LLMs revealed MARs ranging from 4.7% to 18.9%, with frontier models like GPT-5.4 and Claude often exhibiting higher rates (9.9-18.9%), contrasting with human performance at 0.0% MAR. Controlled priming experiments showed that alignment-induced cues can amplify MAR up to 8.9 times. Mechanistic analyses on open-weight LLMs indicated late-layer suppression of correct, evidence-supported answers, emerging after instruction training, where intermediate layer reasoning is overridden in final layers.

Key takeaway

For AI Scientists and Machine Learning Engineers developing aligned LLMs, recognize that current alignment methods can inadvertently cause models to override explicit evidence in stereotype-sensitive contexts. This "misfired alignment" can lead to inconsistent or misleading outputs, particularly in high-stakes applications like healthcare or legal decision-making. You should prioritize developing principled alignment frameworks that balance fairness with robust contextual grounding, investigating and mitigating late-layer suppression mechanisms to ensure factual consistency.

Key insights

Current LLM alignment can overgeneralize safety cues, overriding explicit evidence in stereotype-sensitive contexts.

Principles

Alignment can create an "alignment tax" on reasoning.
Safety heuristics can override factual consistency.
Instruction tuning amplifies misfired alignment.

Method

The VETO benchmark quantifies misfired alignment using 2,032 contrastive pairs and the Misfired Alignment Rate (MAR) metric, measuring failures on stereotype-related questions versus contrastive counterparts.

In practice

Benchmark LLMs with VETO for evidence-grounding.
Analyze late-layer logit differences for suppression.
Ablate attention heads to pinpoint failure causes.

Topics

Misfired Alignment
LLM Alignment
VETO Benchmark
Bias Benchmarking
Mechanistic Interpretability
Instruction Tuning

Code references

MichiganNLP/misfired-alignment

Best for: Research Scientist, AI Scientist, AI Ethicist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.