Compared to What? Baselines and Metrics for Counterfactual Prompting
Summary
A new study argues that counterfactual prompting, widely used to evaluate LLM bias and Chain-of-Thought faithfulness, often misattributes observed output changes to targeted factors. The authors contend that every counterfactual edit is a compound treatment, bundling the variable of interest with incidental surface-form variation, which violates treatment variation irrelevance. For instance, a 14.9% prediction flip rate on MedQA when changing patient gender was statistically indistinguishable from the 14.1% flip rate induced by simple paraphrasing. To address this, they propose a framework that compares targeted interventions against "meaning-preserving" modifications, like token-adjusted paraphrasing, using statistical testing. Applying this to the MedPerturb dataset, they found that previously reported sensitivities to patient demographics and stylistic cues largely dissipated, with only 5 of 120 tests reaching significance. However, the framework successfully detected significant directional gender bias in occupational biography classification using the Bias-in-Bios dataset.
Key takeaway
For AI Engineers and Research Scientists evaluating LLM behavior, your counterfactual prompting experiments must include magnitude-matched paraphrase baselines. Relying solely on aggregate metrics like flip rate can obscure true effects or lead to false positives. Instead, integrate per-sample distributional metrics (JSD, KL) and regression analysis for directional hypotheses to robustly identify and characterize genuine model sensitivities, ensuring your conclusions are statistically sound and not merely reflecting general prompt sensitivity.
Key insights
Counterfactual prompting requires baselines of "meaning-preserving" text modifications to avoid misattributing LLM output changes.
Principles
- Model sensitivity scales with token change percentage.
- Per-sample metrics (JSD, KL) are more powerful than aggregate metrics.
- Regression uniquely characterizes effect direction and magnitude.
Method
Compare targeted intervention effects to a magnitude-matched paraphrase baseline via statistical tests (paired t-test for per-sample, bootstrap for aggregate, t-test for regression coefficients).
In practice
- Use token-adjusted paraphrases as a baseline.
- Prioritize JSD or KL divergence for sensitivity analysis.
- Employ regression for directional bias hypotheses.
Topics
- Counterfactual Prompting
- LLM Evaluation
- Model Sensitivity
- Statistical Hypothesis Testing
- MedPerturb Dataset
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.