Building a Bayesian A/B Testing System That Knows When to Stop

4 minute read

Published:

At Synthure, we ran A/B tests the way most startups do: flip a coin on traffic, wait two weeks, check if $p < 0.05$, ship or revert. This worked until we started testing features that affected claim approval rates, where each day of a bad variant cost real money and delayed patient reimbursements. We couldn’t afford to wait two weeks. We also couldn’t afford to stop early and be wrong.

The problem with fixed-horizon tests is structural. The two-week timeline is arbitrary. Stopping early because results look good inflates the Type I error rate, if you check p-values repeatedly, you will eventually find $p < 0.05$ by chance even with no real effect. The Bayesian framework fixes this at the root by replacing the hypothesis test with a decision rule based on the cost of being wrong.


The Beta-Bernoulli Model

Each user converts (1) or doesn’t (0). The unknown conversion rate $\theta$ gets a Beta prior: $\theta \sim \text{Beta}(\alpha, \beta)$. After observing $k$ conversions in $n$ trials, the posterior is

\[\theta \mid \text{data} \sim \text{Beta}(\alpha + k,\; \beta + n - k)\]

No integration required, the Beta family is conjugate to the Bernoulli. The posterior mean is $(\alpha+k)/(\alpha+\beta+n)$, converging to the true rate as $n$ grows, with variance shrinking as $O(1/n)$. A flat prior is $\text{Beta}(1,1)$; an informative prior encoding a historical 10% conversion rate is $\text{Beta}(5, 45)$.


Stopping via Expected Loss

The classical stopping rule is $p < 0.05$. The Bayesian stopping rule is: stop when the expected loss of acting on your current best estimate falls below a threshold $\varepsilon$.

Define the expected loss of shipping variant A when B might be better:

\[\text{EL}(A) = \mathbb{E}[(\theta_B - \theta_A)^+] = \int_0^1\!\int_0^1 \max(\theta_B - \theta_A,\, 0)\; p(\theta_A)\, p(\theta_B)\; d\theta_A\, d\theta_B\]

This is the expected conversion rate you’d leave on the table by shipping A. When $\text{EL}(A) < \varepsilon$ (say 0.001, less than 0.1% expected loss), ship A. The computation is a Monte Carlo estimate over posterior samples, callable after every observation batch:

sa = np.random.beta(alpha_a, beta_a, 50_000)
sb = np.random.beta(alpha_b, beta_b, 50_000)
loss_a = np.mean(np.maximum(sb - sa, 0))
loss_b = np.mean(np.maximum(sa - sb, 0))
# stop when min(loss_a, loss_b) < epsilon

This is not a hypothesis test. It is a conditional expectation, which can be safely recomputed as data arrives. There is no peeking problem because you are not accumulating evidence toward a threshold, you are updating a belief and querying its current decision implications.


Simulation Results

True rates: control 10%, treatment 12% (20% relative lift). 1,000 simulated experiments:

MethodAvg users neededPowerFalse positive rateAvg days (100 users/day)
Fixed horizon $n=1000$2,00087%5.0%20 days
Bayesian EL < 0.00184789%2.1%8.5 days
Bayesian EL < 0.00561283%3.4%6.1 days

The Bayesian test concludes in less than half the time with similar or better statistical properties. When the effect is large and consistent, both posteriors separate quickly and EL drops below threshold early. When there is no real effect, posteriors stay overlapped and EL never triggers, the test withholds judgment rather than returning a false positive through repeated checking.


Thompson Sampling for Ethical Traffic Allocation

During the test, Thompson sampling uses the posterior to allocate traffic rather than splitting 50/50:

theta_a = random.betavariate(alpha_a, beta_a)
theta_b = random.betavariate(alpha_b, beta_b)
variant = "A" if theta_a > theta_b else "B"

As evidence accumulates, the better variant’s posterior tightens and its samples are consistently higher, so it receives more traffic. On the same simulation, Thompson sampling allocates 71% of traffic to the treatment variant by the end, versus 50% under fixed-horizon testing. At 1,000 total users, roughly 210 fewer users see the inferior experience during the test.

In healthcare, this matters not as a statistical nicety but as an ethical alignment. Every user in a test is a person whose reimbursement outcome may be affected by which variant they received. Thompson sampling directly minimizes the harm done while collecting the evidence needed to act.


The Deeper Point

The p-value answers: “if the null hypothesis were true, how surprising would this data be?” The expected loss answers: “given what we currently know, how costly is each decision?” For product teams making real decisions under uncertainty, the second question is almost always the right one. The math, conjugate updating, Monte Carlo loss estimation, is simple. The operational benefit is real: faster decisions, lower false positive rates, and an explicit accounting of the cost of uncertainty that aligns statistics with the goal of treating users well.