Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

2026-06-14 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new controlled protocol evaluates large language model (LLM) answer stability, addressing the limitation of standard accuracy benchmarks in assessing how LLMs maintain correct answers when faced with plausible counter-arguments. This method involves challenging a model's initially correct multiple-choice answer with a coherent argument for an incorrect option, then measuring if the model "flips." Across seven frontier models and 57 MMLU subjects, observed flip rates ranged significantly from 17.5% to 97.3%, highlighting substantial differences in stability not captured by accuracy metrics. The study found that self-attribution consistently increased flip rates by a mean of +7.1 percentage points, reaching up to +18.7 percentage points. Furthermore, pooling wrong-answer arguments from multiple models and selecting the most effective ones per question generated stronger adversarial challenges than using a single source. A curated challenge set, MaxFlip, was constructed, amplifying flips by up to +23.6 percentage points over standard self-generated challenges. The protocol, challenge records, and MaxFlip are released to support stability evaluation.

Key takeaway

For Machine Learning Engineers evaluating LLM robustness, you should integrate answer stability testing using protocols like MaxFlip alongside traditional accuracy benchmarks. Your model selection for critical applications must consider its flip rate, especially when exposed to self-attributed or cross-model generated counter-arguments. This reveals vulnerabilities beyond simple correctness, informing more resilient deployment strategies.

Key insights

LLMs show wide answer instability against plausible counter-arguments, a critical metric distinct from standard accuracy benchmarks.

Principles

Self-attribution boosts LLM flip rates.
Pooled cross-model arguments are more adversarial.
Answer stability is a distinct LLM metric.

Method

After a model correctly answers a multiple-choice question, present a coherent counter-argument for an incorrect option and measure if the model changes its answer.

In practice

Use MaxFlip for amplified stability testing.
Integrate stability alongside accuracy benchmarks.
Challenge models with self-attributed arguments.

Topics

LLM Evaluation
Answer Stability
Counter-arguments
MMLU Benchmark
MaxFlip Dataset
Adversarial Robustness

Code references

nafisenik/WhoFlips

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.