Sample Size Mistakes That Invalidate Your A/B Tests
Common errors in calculating sample size and how to avoid them for reliable, actionable test results. Don't waste your traffic on invalid tests.
You spend weeks setting up an A/B test. You wait for results. You celebrate a "winner." Then you implement it... and conversions don't improve—or worse, they drop.
What went wrong? Usually, sample size.
Sample size mistakes are the silent killers of A/B testing programs. They lead to false positives, wasted traffic, and misguided decisions. Here are the most common mistakes and how to avoid them.
Mistake #1: Not Calculating Sample Size Before Testing
The mistake: Launching a test without knowing how much traffic you need, then stopping when you "feel" you have enough data.
Why it's wrong: You can't determine if a test is conclusive by gut feeling. You might stop too early (false positives) or run too long (wasted time).
The fix: Always calculate required sample size before launching. You need to know:
- Your baseline conversion rate
- Minimum detectable effect (MDE) you care about
- Desired statistical power (typically 80%)
- Significance level (typically 95%)
Use our sample size calculator to determine exactly how many visitors you need per variant.
Example Calculation
- Baseline conversion rate: 5%
- Minimum lift you care about: 10% relative (5% → 5.5%)
- Power: 80%
- Significance: 95%
- Required sample size: ~16,000 visitors per variant
Mistake #2: Setting Unrealistic MDE (Minimum Detectable Effect)
The mistake: Expecting to detect tiny lifts like 1-2% with limited traffic.
Why it's wrong: Detecting small effects requires massive sample sizes. A test looking for a 1% lift needs 100x more traffic than one looking for a 10% lift.
The reality check:
- Low-traffic site (<10k visitors/month): Look for 20-30% lifts
- Medium-traffic site (10-100k): Look for 10-15% lifts
- High-traffic site (>100k): Can detect 5-10% lifts
The fix: Be realistic about what you can detect given your traffic. Focus on high-impact changes that could reasonably produce 10%+ lifts.
Mistake #3: Ignoring Statistical Power
The mistake: Only considering significance level (alpha) without thinking about power (1 - beta).
Why it's wrong: Low power means you'll miss real effects (false negatives). You'll conclude "no difference" when there actually is one.
What power means:
- 80% power: If a real effect exists, you have an 80% chance of detecting it
- 50% power: You're basically flipping a coin—might as well not test
The fix: Always use at least 80% power in your sample size calculations. Never compromise on power to run tests faster.
Mistake #4: Stopping at "Statistical Significance"
The mistake: Monitoring your test and stopping immediately when p < 0.05, regardless of whether you've reached your planned sample size.
Why it's wrong: This is called "peeking" or "p-hacking." Checking results repeatedly and stopping at the first sign of significance dramatically inflates your false positive rate from 5% to 20-30%.
The fix: Either:
- Fixed-horizon testing: Decide sample size upfront and don't look until you reach it
- Sequential testing: Use proper sequential sampling methodology that accounts for continuous monitoring
Mistake #5: Not Accounting for Weekday/Weekend Variation
The mistake: Running tests for exactly 7 days or stopping mid-week.
Why it's wrong: User behavior often differs dramatically between weekdays and weekends. Stopping mid-cycle can bias results.
The fix: Always run tests for full weeks (multiples of 7 days). If your business has monthly patterns, run for full months.
Mistake #6: Testing Too Many Variants at Once
The mistake: Running an A/B/C/D/E test with 5 variants simultaneously.
Why it's wrong: More variants = traffic split more ways = each variant needs much longer to reach significance. A 3-variant test needs ~3x the time of a 2-variant test.
The math:
- 2 variants (A/B): 16,000 visitors total
- 3 variants (A/B/C): 24,000 visitors total
- 5 variants (A/B/C/D/E): 40,000 visitors total
The fix: Stick to A/B tests (2 variants) unless you have massive traffic. If you must test multiple variations, use a multi-armed bandit algorithm instead.
Mistake #7: Ignoring Traffic Quality Differences
The mistake: Calculating sample size based on total traffic without accounting for traffic quality.
Why it's wrong: Not all traffic is equal. If you're testing a checkout change, only users who reach checkout matter. If 80% of traffic bounces before checkout, you need 5x more total traffic.
The fix: Calculate sample size based on qualified traffic that will actually see your test.
Real-World Example: A Mistake That Cost $50K
The scenario: E-commerce site testing a new product page design.
- Baseline conversion rate: 3%
- Wanted to detect: 5% lift (3% → 3.15%)
- Actual traffic: 5,000 visitors/week
- Test duration: 2 weeks
- Total sample: 10,000 visitors (5,000 per variant)
What they did: After 2 weeks, they saw variant B performing 8% better with p = 0.04. They declared a winner and implemented B site-wide.
What went wrong: To detect a 5% lift with 80% power, they needed 80,000 visitors per variant—16x more than they ran. They only had 6% power. Their "significant" result was noise.
The result: After full implementation, conversions dropped 3%. They spent $50K in lost revenue and 3 months rolling back changes.
How to Calculate Sample Size Correctly
Follow this process every time:
- Define your metric: What are you measuring? (conversion rate, revenue per visitor, etc.)
- Determine baseline: What's your current performance?
- Set MDE: What's the minimum lift that matters to your business?
- Choose power and significance: Typically 80% power, 95% confidence
- Calculate sample size: Use our sample size calculator
- Estimate duration: Sample size ÷ weekly qualified traffic = weeks needed
- Decide if viable: If duration >8 weeks, consider testing something with higher impact
Key Takeaways
- Always calculate sample size before launching tests
- Be realistic about minimum detectable effect given your traffic
- Use 80% power minimum—never compromise on power
- Don't peek at results unless using sequential testing methodology
- Run tests for full weeks/months to account for cyclical patterns
- Stick to A/B (2 variants) unless you have massive traffic
- Calculate based on qualified traffic that actually sees your test
Get Expert Testing Guidance
Wise Uplift designs statistically rigorous testing programs that avoid these pitfalls and deliver reliable results.
Get a Proposal