Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs
Summary
A new controlled protocol evaluates large language model (LLM) answer stability, addressing the limitation of standard accuracy benchmarks in assessing how LLMs maintain correct answers when faced with plausible counter-arguments. This method involves challenging a model's initially correct multiple-choice answer with a coherent argument for an incorrect option, then measuring if the model "flips." Across seven frontier models and 57 MMLU subjects, observed flip rates ranged significantly from 17.5% to 97.3%, highlighting substantial differences in stability not captured by accuracy metrics. The study found that self-attribution consistently increased flip rates by a mean of +7.1 percentage points, reaching up to +18.7 percentage points. Furthermore, pooling wrong-answer arguments from multiple models and selecting the most effective ones per question generated stronger adversarial challenges than using a single source. A curated challenge set, MaxFlip, was constructed, amplifying flips by up to +23.6 percentage points over standard self-generated challenges. The protocol, challenge records, and MaxFlip are released to support stability evaluation.
Key takeaway
For Machine Learning Engineers evaluating LLM robustness, you should integrate answer stability testing using protocols like MaxFlip alongside traditional accuracy benchmarks. Your model selection for critical applications must consider its flip rate, especially when exposed to self-attributed or cross-model generated counter-arguments. This reveals vulnerabilities beyond simple correctness, informing more resilient deployment strategies.
Key insights
LLMs show wide answer instability against plausible counter-arguments, a critical metric distinct from standard accuracy benchmarks.
Principles
- Self-attribution boosts LLM flip rates.
- Pooled cross-model arguments are more adversarial.
- Answer stability is a distinct LLM metric.
Method
After a model correctly answers a multiple-choice question, present a coherent counter-argument for an incorrect option and measure if the model changes its answer.
In practice
- Use MaxFlip for amplified stability testing.
- Integrate stability alongside accuracy benchmarks.
- Challenge models with self-attributed arguments.
Topics
- LLM Evaluation
- Answer Stability
- Counter-arguments
- MMLU Benchmark
- MaxFlip Dataset
- Adversarial Robustness
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.