Why Most A/B Tests Are Lying to You
Summary
Many A/B tests yield unreliable results due to common statistical errors, leading teams to ship changes based on noise rather than true signal. Key issues include "peeking" at results before a test concludes, which inflates the false positive rate from 5% to 26.1%; running underpowered tests that exaggerate real effects; the multiple comparisons problem, where testing many metrics increases the chance of false positives to 64.2% for 20 metrics; and confusing statistical significance with practical significance, leading to the implementation of worthless changes. Even Bayesian A/B testing, often perceived as a solution, does not inherently solve the peeking problem, with simulations showing false positive rates as high as 80% when fixed posterior thresholds are used as stopping rules. Addressing these issues requires a disciplined pre-test protocol.
Key takeaway
For Product Managers and Data Scientists running A/B tests, adopting a rigorous pre-test protocol is crucial to avoid shipping changes based on statistical noise. You should implement the 5-point checklist before every experiment to ensure valid results, prevent inflated effect sizes, and maintain stakeholder trust. This discipline, though requiring upfront effort, ensures your experimentation program drives real, compounding gains rather than costly, imaginary wins.
Key insights
Common A/B testing errors inflate false positives and exaggerate effects, requiring strict pre-test protocols.
Principles
- Frequentist tests assume a single look.
- Underpowered tests inflate observed effects.
- Multiple comparisons increase false positives.
Method
Implement a 5-point pre-test checklist: calculate sample size, fix runtime, declare one primary metric, set practical significance, and choose an analysis method (Frequentist, Bayesian, or Sequential) with documented rationale.
In practice
- Use Evan Miller's calculator for sample size.
- Apply Benjamini-Hochberg for multiple metrics.
- Define Minimum Detectable Effect (MDE) upfront.
Topics
- A/B Testing
- Statistical Significance
- False Positive Rate
- Bayesian A/B Testing
- Sequential Testing
Best for: Product Manager, Data Scientist, Data Analyst
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.