Free Tool

A/B Test Sample Size & Duration Calculator

Find how many visitors — and how many days — you need to detect a real difference, and understand exactly what the result means. No more ending tests on noise.

Try an example

%

Your form’s current completion (or conversion) rate.

%

Including control.

Total across all variants.

+ Advanced settings

95% confidence and 80% power are the standard defaults — leave them unless you have a specific reason.

Sample size required

visitors per variant

Run real A/B tests on your forms →

Ovoform splits traffic and tells you the winner — with real statistics.

What this means

    Duration vs. detectable improvement

    The longer you run, the smaller an improvement you can reliably detect. Based on your current rate and traffic:

    Run for Smallest detectable improvement

    What a healthy test looks like over time

    1

    Day 1 — Not enough data

    Early numbers swing wildly. Any “winner” here is almost certainly noise. Don’t look at the result yet.

    5

    Day 5 — Gathering evidence

    A trend appears but the sample is still partial and a full weekly cycle isn’t covered. Keep waiting.

    10

    Day 10 — Reliable result

    Sample size reached and at least one full week covered. Now the result is trustworthy enough to act on.

    How this works

    What is a conversion rate?

    The share of visitors who complete the action you care about — finishing a form, signing up, buying. It’s your starting point, or “baseline.”

    What is MDE?

    The Minimum Detectable Effect is the smallest relative lift you want to be sure you can catch. Catching a tiny change takes a lot of traffic; catching a big one takes far less.

    What are confidence and power?

    Confidence (95%) limits false positives — declaring a winner that isn’t. Power (80%) limits false negatives — missing a real winner. Raising either makes the test stricter and needs more visitors.

    How is sample size calculated?

    From a two-proportion power formula using your baseline rate, target rate (baseline × (1 + MDE)), and the z-values for your confidence and power. It returns the visitors-per-variant needed to detect that difference if it’s real.

    Why sample size matters

    Too small, and your test is a coin flip dressed up as data. Committing to the right sample size up front — and waiting for it — is the difference between decisions you can trust and confident guesses.

    Inside Ovoform you don’t do this by hand: every test shows a live readiness gate and a Bayesian “probability to be best,” so you know exactly when a result is real.

    FAQs

    Frequently Asked Questions

    Statistical significance is the probability that the difference you measured between two variants is real, not random chance. At 95% confidence, there is only a 5% chance that a “winning” result is actually a fluke. It is what separates a genuine improvement from noise.

    Confidence level is how sure you want to be before declaring a winner. 95% (the standard default) means you accept a 5% risk of a false positive — calling a winner that is not really better. Higher confidence (99%) is stricter and needs more visitors.

    Power is the chance your test detects a real difference when one truly exists. 80% power (the standard default) means an 80% chance of catching a genuine improvement. Higher power (90%) misses fewer real winners but requires more visitors.

    MDE is the smallest relative improvement you want to be able to detect. A 20% MDE on a 30% conversion rate means detecting a lift to 36%. Smaller improvements are harder to see, so they need far more traffic; larger improvements need much less.

    It depends on your current conversion rate and your MDE. Lower rates and smaller target improvements need more visitors. Enter your numbers in the calculator above for the exact sample size per variant.

    Generally no. “Peeking” — checking results repeatedly and stopping the moment they look significant — massively inflates false positives. Decide on your sample size up front and wait until you reach it (and a full weekly cycle) before deciding.

    Run until you reach the required sample size AND at least one full week — ideally two — so weekday and weekend behaviour are both represented. Ending on a single strong day is one of the most common ways teams fool themselves.

    The required sample size is set by your conversion rate and MDE — not by your traffic. So a low-traffic site needs the exact same number of visitors; it simply takes more days to accumulate them. That is why duration, not just sample size, matters.

    An underpowered test is unreliable. It can miss real improvements (false negatives) and produce noisy results that swing between “significant” and “not significant.” Any winner you declare is more likely to be chance than a genuine effect.