A/B Test Calculator πŸ“Š

The definitive tool for product managers, data scientists, and growth engineers.

Which test should I use?

Step 1

What are you measuring?

PM Tip

Frequentist (Z-test)

Industry standard for final decisions. Requires a fixed sample size determined beforehand.

  • Outputs: P-value, Confidence Intervals
  • Rule: p < 0.05 is typically "significant"
  • Wait for the full sample or risk false positives!
DA Tip

Bayesian Analysis

Best for startups moving fast. Provides an intuitive "probability of B > A".

  • Outputs: P(B>A), Expected Loss
  • Flexible: You can look at the results anytime
  • Answers: "How likely is this variant to be better?"
PM Tip

Sequential Testing

Best when you can't wait. Uses alpha-spending to allow for "safe peeking".

  • Avoids the "peeking problem" of frequentist tests
  • Stops early if a massive win is detected
  • Requires slightly larger total sample if no early stop
Confidence Alpha (Risk) When to Use Example Scenario
90% 10% Low-risk UI tweaks Changing button color, microcopy
95% 5% Most decisions New features, layout changes, onboarding flow
99% 1% High-stakes / Critical Pricing, checkout logic, security features

πŸ’¬ Slack Summary

πŸ“ Executive Summary

πŸ“Š Data Team Summary

Understanding A/B Test Statistics

A/B testing (also known as split testing) is a randomized experimentation process wherein two or more versions of a variable (web page, page element, etc.) are shown to different segments of website visitors at the same time to determine which version leaves the maximum impact and drive business metrics.

Frequentist vs. Bayesian Inference

The two primary schools of statistical thought in A/B testing are Frequentist and Bayesian. Frequentist statistics (Z-tests, P-values) is the traditional approach. It tests whether the observed difference is "significant" based on the null hypothesis that there is no difference. It requires a fixed sample size to be decided upfront.

Bayesian statistics, on the other hand, provides the probability that a version is better than another. It is more intuitive for decision-making but can be computationally more intensive. It allows for "continuous monitoring" without as high a risk of the "peeking problem" found in frequentist methods.

The P-Value and Alpha

A p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. An alpha (Ξ±) of 0.05 means you are willing to accept a 5% risk of a Type I errorβ€”declaring a winner when there actually isn't one (a false positive).

Sample Size and Statistical Power

Power (1 - Ξ²) is the probability that the test will correctly reject the null hypothesis when there is an actual effect. Standard power is usually 80%. If your sample size is too small, your test may be "underpowered," meaning you might miss a real improvement because you didn't have enough data to see through the noise.

The Peeking Problem

One of the most common mistakes in A/B testing is "peeking" at the results and stopping the test as soon as a p-value drops below 0.05. This dramatically increases your false positive rate because p-values naturally fluctuate. Frequentist tests must run until their pre-calculated sample size is reached. If you need to peek, use Sequential Testing or Bayesian methods.

Multi-variant Testing (A/B/n)

When testing multiple variations (e.g., A vs B vs C vs D), you increase the risk of a false positive through "multiple comparisons." To fix this, we use the Bonferroni Correction, which divides the alpha by the number of comparisons. Our calculator handles this automatically when you add more than two variants.

Standard Error of Mean (Revenue Testing)

For revenue or average order value (AOV), we use Welch’s t-test. This takes into account the mean, standard deviation, and sample size of each group. Since revenue data is often skewed, ensuring a large enough sample size is critical for the Central Limit Theorem to take effect.

Interpreting Confidence Intervals

A 95% confidence interval gives you a range of values that likely contains the true conversion rate difference. If the interval includes 0, the result is not statistically significant. If the entire interval is above 0, you have a winner!

Sample Ratio Mismatch (SRM)

SRM is a critical health check. If you aim for a 50/50 split but get a 40/60 split, your randomization might be broken. This can happen due to bot traffic, redirect issues, or tracking bugs. Always check if the actual visitor split matches the intended split using a Chi-squared test.

Minimum Detectable Effect (MDE)

MDE is the smallest improvement you care about detecting. Choosing a smaller MDE requires a much larger sample size. Setting a realistic MDE prevents you from wasting weeks chasing tiny, insignificant gains that don't move the needle for the business.

Frequently Asked Questions

How long should I run an A/B test?

Typically at least 1-2 full business cycles (usually 7-14 days) to account for day-of-week effects. Even if you reach significance in 2 days, keep running to avoid novelty effects.

What is a "winner" in Bayesian terms?

In Bayesian testing, a common threshold is when the Probability to Beat Control is >95% AND the Expected Loss (the risk of choosing the variant if it's actually worse) is below a negligible threshold (e.g., <0.1%).

Why does my test show significance but then the uplift disappears after launch?

This is often due to "Regression to the Mean" or the "Novelty Effect" (users click because it's new, not better). External factors like marketing campaigns or seasonal holidays can also skew results.

Can I test more than one thing at a time?

Yes, that's Multivariate Testing (MVT). However, MVT requires significantly more traffic because you are testing the interactions between multiple elements. For most sites, sequential A/B tests are more efficient.

What is one-tailed vs two-tailed testing?

A two-tailed test looks for a difference in either direction (better or worse). A one-tailed test only looks for improvement. Most rigorous scientists use two-tailed tests to be conservative.