Why Most A/B Tests Are Lying to You

2026-03-11 · Source: Towards Data Science · Field: Technology & Digital — Data Science & Analytics · Depth: Intermediate, long

Summary

Many A/B tests yield unreliable results due to common statistical errors, leading teams to ship changes based on noise rather than true signal. Key issues include "peeking" at results before a test concludes, which inflates the false positive rate from 5% to 26.1%; running underpowered tests that exaggerate real effects; the multiple comparisons problem, where testing many metrics increases the chance of false positives to 64.2% for 20 metrics; and confusing statistical significance with practical significance, leading to the implementation of worthless changes. Even Bayesian A/B testing, often perceived as a solution, does not inherently solve the peeking problem, with simulations showing false positive rates as high as 80% when fixed posterior thresholds are used as stopping rules. Addressing these issues requires a disciplined pre-test protocol.

Key takeaway

For Product Managers and Data Scientists running A/B tests, adopting a rigorous pre-test protocol is crucial to avoid shipping changes based on statistical noise. You should implement the 5-point checklist before every experiment to ensure valid results, prevent inflated effect sizes, and maintain stakeholder trust. This discipline, though requiring upfront effort, ensures your experimentation program drives real, compounding gains rather than costly, imaginary wins.

Key insights

Common A/B testing errors inflate false positives and exaggerate effects, requiring strict pre-test protocols.

Principles

Frequentist tests assume a single look.
Underpowered tests inflate observed effects.
Multiple comparisons increase false positives.

Method

Implement a 5-point pre-test checklist: calculate sample size, fix runtime, declare one primary metric, set practical significance, and choose an analysis method (Frequentist, Bayesian, or Sequential) with documented rationale.

In practice

Use Evan Miller's calculator for sample size.
Apply Benjamini-Hochberg for multiple metrics.
Define Minimum Detectable Effect (MDE) upfront.

Topics

A/B Testing
Statistical Significance
False Positive Rate
Bayesian A/B Testing
Sequential Testing

Best for: Product Manager, Data Scientist, Data Analyst

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.