Why Most A/B Tests I Have Seen Are Statistically Invalid

I want to start with a specific story. A product manager I worked with ran an A/B test on a checkout button — different colour, slightly different copy. After two days, variant B was showing a 12% lift in conversion. They stopped the test and shipped variant B. Three weeks later, when someone ran a proper retrospective analysis, the difference had evaporated entirely. The 12% lift was noise. Two days of data, no sample size calculation, peeking at results daily — every mistake in the book.

This is not an unusual story. In my experience, the majority of A/B tests run by product teams are statistically invalid in some meaningful way. The tests are not fraudulent — everyone genuinely believes they are running real experiments. The problem is that the statistical requirements for a valid experiment are not intuitive, and when teams learn "just check if p < 0.05", they are learning the minimum possible amount of statistics, not enough to avoid the most common mistakes.

Here is what I think every developer working on a product team needs to understand.

The Sample Size Calculation You Are Probably Skipping

The most important thing you can do before starting an A/B test is calculate how many users you need. Not after seeing early results, not retrospectively — before the experiment begins. This is called a power calculation, and skipping it is the most common reason tests produce unreliable results.

import math
from scipy import stats

def required_sample_size(
    baseline_rate: float,  # e.g. 0.05 for 5% conversion
    mde: float,            # minimum detectable effect, e.g. 0.01 for +1pp
    alpha: float = 0.05,   # significance threshold (false positive rate)
    power: float = 0.80,   # probability of detecting a real effect
) -> int:
    p1 = baseline_rate
    p2 = baseline_rate + mde
    p_bar = (p1 + p2) / 2
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta  = stats.norm.ppf(power)
    n = (
        (z_alpha * math.sqrt(2 * p_bar * (1 - p_bar)) +
         z_beta  * math.sqrt(p1*(1-p1) + p2*(1-p2))) ** 2
    ) / (p2 - p1) ** 2
    return math.ceil(n)

# Typical scenario: 5% baseline, want to detect a 1 percentage point improvement
print(required_sample_size(0.05, 0.01))  # → 14,751 per group

Nearly 15,000 users per group to reliably detect a one percentage point improvement from a 5% baseline. If your site sees 500 conversions per day and you split traffic 50/50, it takes roughly 59 days to reach significance. Two days of data gives you nothing.

The number surprises most people because it depends heavily on the effect size you are trying to detect. Wanting to detect a 0.5 percentage point improvement requires roughly four times as many users. Wanting to detect a 2 percentage point improvement requires roughly four times fewer. This is why specifying the Minimum Detectable Effect (MDE) before the test is not bureaucracy — it is the only way to know how long to run.

The Peeking Problem (and Why It Invalidates Your Results)

The most seductive mistake in A/B testing is looking at the results before your sample size is reached and stopping early if things look significant. This practice — called peeking — inflates your false positive rate dramatically.

Here is the intuition: if you flip a fair coin repeatedly and stop any time you see five more heads than tails, you will stop "with evidence of unfair coin" quite often, even though the coin is fair. The p < 0.05 threshold is only valid as a stopping criterion if you look exactly once, at a predetermined sample size. Every additional look increases the probability of seeing a false positive.

Empirically: checking your test at 10 predetermined points during the run inflates the actual false positive rate from 5% to roughly 19%. Checking daily can make it 50% or higher.

The practical solutions:

Fix the sample size in advance and look only once. Simple and statistically valid. Hard to enforce in teams under pressure to ship.
Use sequential testing methods (SPRT, mSPRT). These are designed for continuous monitoring and adjust the significance threshold to account for multiple looks. Tools like Statsig, Optimizely, and LaunchDarkly implement these.
Use Bayesian A/B testing. Frames the experiment as belief updating rather than hypothesis rejection, and handles early stopping more naturally — though interpretation requires care.

Statistical Significance vs. Practical Significance

A result can be statistically significant and completely unimportant. With enough users, you can detect a 0.01% improvement in conversion rate at p < 0.001. Whether that improvement is worth the engineering cost of shipping the change is a separate question that statistics cannot answer for you.

I now insist on reporting results with confidence intervals rather than just p-values. "Conversion increased from 4.8% to 5.3% (95% CI: +0.2pp to +0.8pp, p = 0.004)" is actionable. "Result was significant" is not. The confidence interval tells you the range of plausible true effects — which tells you whether even the low end of the range is worth shipping for.

The Multiple Comparisons Problem

If you test 20 independent hypotheses at alpha = 0.05, you expect one false positive by chance alone — even if none of your hypotheses are actually true. This is the multiple comparisons problem. It is directly relevant to product teams who measure 10 or 20 metrics in every A/B test and then report on whichever ones look good.

The fix: pre-register exactly one primary metric before the test begins. That is the metric that determines whether you ship. Secondary metrics are for insight and hypothesis generation, not for decision-making. If you need to test multiple hypotheses, apply a Bonferroni correction (divide your alpha by the number of tests) or use a proper multiple testing framework.

What a Valid Test Actually Looks Like

Before starting any test, I now write a short test plan that answers:

What is the primary metric? (One only.)
What is the current baseline rate for that metric?
What is the minimum detectable effect we care about?
What is the required sample size per group?
How long will it take at current traffic levels?
Are there confounders to watch for? (Day-of-week effects, holidays, deployments.)

Running the test then becomes mechanical: wait for the sample size, look once, decide. The rigour is front-loaded into the design phase, where it belongs.