The definitive tool for product managers, data scientists, and growth engineers.
Industry standard for final decisions. Requires a fixed sample size determined beforehand.
Best for startups moving fast. Provides an intuitive "probability of B > A".
Best when you can't wait. Uses alpha-spending to allow for "safe peeking".
| Confidence | Alpha (Risk) | When to Use | Example Scenario |
|---|---|---|---|
| 90% | 10% | Low-risk UI tweaks | Changing button color, microcopy |
| 95% | 5% | Most decisions | New features, layout changes, onboarding flow |
| 99% | 1% | High-stakes / Critical | Pricing, checkout logic, security features |
A/B testing (also known as split testing) is a randomized experimentation process wherein two or more versions of a variable (web page, page element, etc.) are shown to different segments of website visitors at the same time to determine which version leaves the maximum impact and drive business metrics.
The two primary schools of statistical thought in A/B testing are Frequentist and Bayesian. Frequentist statistics (Z-tests, P-values) is the traditional approach. It tests whether the observed difference is "significant" based on the null hypothesis that there is no difference. It requires a fixed sample size to be decided upfront.
Bayesian statistics, on the other hand, provides the probability that a version is better than another. It is more intuitive for decision-making but can be computationally more intensive. It allows for "continuous monitoring" without as high a risk of the "peeking problem" found in frequentist methods.
A p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. An alpha (Ξ±) of 0.05 means you are willing to accept a 5% risk of a Type I errorβdeclaring a winner when there actually isn't one (a false positive).
Power (1 - Ξ²) is the probability that the test will correctly reject the null hypothesis when there is an actual effect. Standard power is usually 80%. If your sample size is too small, your test may be "underpowered," meaning you might miss a real improvement because you didn't have enough data to see through the noise.
One of the most common mistakes in A/B testing is "peeking" at the results and stopping the test as soon as a p-value drops below 0.05. This dramatically increases your false positive rate because p-values naturally fluctuate. Frequentist tests must run until their pre-calculated sample size is reached. If you need to peek, use Sequential Testing or Bayesian methods.
When testing multiple variations (e.g., A vs B vs C vs D), you increase the risk of a false positive through "multiple comparisons." To fix this, we use the Bonferroni Correction, which divides the alpha by the number of comparisons. Our calculator handles this automatically when you add more than two variants.
For revenue or average order value (AOV), we use Welchβs t-test. This takes into account the mean, standard deviation, and sample size of each group. Since revenue data is often skewed, ensuring a large enough sample size is critical for the Central Limit Theorem to take effect.
A 95% confidence interval gives you a range of values that likely contains the true conversion rate difference. If the interval includes 0, the result is not statistically significant. If the entire interval is above 0, you have a winner!
SRM is a critical health check. If you aim for a 50/50 split but get a 40/60 split, your randomization might be broken. This can happen due to bot traffic, redirect issues, or tracking bugs. Always check if the actual visitor split matches the intended split using a Chi-squared test.
MDE is the smallest improvement you care about detecting. Choosing a smaller MDE requires a much larger sample size. Setting a realistic MDE prevents you from wasting weeks chasing tiny, insignificant gains that don't move the needle for the business.
Typically at least 1-2 full business cycles (usually 7-14 days) to account for day-of-week effects. Even if you reach significance in 2 days, keep running to avoid novelty effects.
In Bayesian testing, a common threshold is when the Probability to Beat Control is >95% AND the Expected Loss (the risk of choosing the variant if it's actually worse) is below a negligible threshold (e.g., <0.1%).
This is often due to "Regression to the Mean" or the "Novelty Effect" (users click because it's new, not better). External factors like marketing campaigns or seasonal holidays can also skew results.
Yes, that's Multivariate Testing (MVT). However, MVT requires significantly more traffic because you are testing the interactions between multiple elements. For most sites, sequential A/B tests are more efficient.
A two-tailed test looks for a difference in either direction (better or worse). A one-tailed test only looks for improvement. Most rigorous scientists use two-tailed tests to be conservative.