Sample Size Mistakes That Invalidate Your A/B Tests

You spend weeks setting up an A/B test. You wait for results. You celebrate a "winner." Then you implement it... and conversions don't improve—or worse, they drop.

What went wrong? Usually, sample size.

Sample size mistakes are the silent killers of A/B testing programs. They lead to false positives, wasted traffic, and misguided decisions. Here are the most common mistakes and how to avoid them.

Mistake #1: Not Calculating Sample Size Before Testing

The mistake: Launching a test without knowing how much traffic you need.

The fix: Always calculate required sample size before launching using baseline rate, MDE, power, and significance level.

Mistake #2: Setting Unrealistic MDE (Minimum Detectable Effect)

The mistake: Expecting to detect tiny lifts like 1-2% with limited traffic.

Why it's wrong: Detecting small effects requires massive sample sizes. A test looking for a 1% lift needs 100x more traffic than one looking for a 10% lift.

The reality check:

Low-traffic site (<10k visitors/month): Look for 20-30% lifts
Medium-traffic site (10-100k): Look for 10-15% lifts
High-traffic site (>100k): Can detect 5-10% lifts

The fix: Be realistic about what you can detect given your traffic. Focus on high-impact changes that could reasonably produce 10%+ lifts.

Mistake #3: Ignoring Statistical Power

The mistake: Only considering significance level (alpha) without thinking about power (1 - beta).

Why it's wrong: Low power means you'll miss real effects (false negatives). You'll conclude "no difference" when there actually is one.

What power means:

80% power: If a real effect exists, you have an 80% chance of detecting it
50% power: You're basically flipping a coin—might as well not test

The fix: Always use at least 80% power in your sample size calculations. Never compromise on power to run tests faster.

Mistake #4: Stopping at "Statistical Significance"

The mistake: Monitoring your test and stopping immediately when p < 0.05, regardless of whether you've reached your planned sample size.

Why it's wrong: This is called "peeking" or "p-hacking." Checking results repeatedly and stopping at the first sign of significance dramatically inflates your false positive rate from 5% to 20-30%.

The fix: Either:

Fixed-horizon testing: Decide sample size upfront and don't look until you reach it
Sequential testing: Use proper sequential sampling methodology that accounts for continuous monitoring

Mistake #5: Not Accounting for Weekday/Weekend Variation

The mistake: Running tests for exactly 7 days or stopping mid-week.

Why it's wrong: User behavior often differs dramatically between weekdays and weekends. Stopping mid-cycle can bias results.

The fix: Always run tests for full weeks (multiples of 7 days). If your business has monthly patterns, run for full months.

Mistake #6: Testing Too Many Variants at Once

The mistake: Running an A/B/C/D/E test with 5 variants simultaneously.

Why it's wrong: More variants = traffic split more ways = each variant needs much longer to reach significance. A 3-variant test needs ~3x the time of a 2-variant test.

The math:

2 variants (A/B): 16,000 visitors total
3 variants (A/B/C): 24,000 visitors total
5 variants (A/B/C/D/E): 40,000 visitors total

The fix: Stick to A/B tests (2 variants) unless you have massive traffic. If you must test multiple variations, use a multi-armed bandit algorithm instead.

Mistake #7: Ignoring Traffic Quality Differences

The mistake: Calculating sample size based on total traffic without accounting for traffic quality.

Why it's wrong: Not all traffic is equal. If you're testing a checkout change, only users who reach checkout matter. If 80% of traffic bounces before checkout, you need 5x more total traffic.

The fix: Calculate sample size based on qualified traffic that will actually see your test.

Real-World Example: A Mistake That Cost $50K

The scenario: E-commerce site testing a new product page design.

Baseline conversion rate: 3%
Wanted to detect: 5% lift (3% → 3.15%)
Actual traffic: 5,000 visitors/week
Test duration: 2 weeks
Total sample: 10,000 visitors (5,000 per variant)

What they did: After 2 weeks, they saw variant B performing 8% better with p = 0.04. They declared a winner and implemented B site-wide.

What went wrong: To detect a 5% lift with 80% power, they needed 80,000 visitors per variant—16x more than they ran. They only had 6% power. Their "significant" result was noise.

The result: After full implementation, conversions dropped 3%. They spent $50K in lost revenue and 3 months rolling back changes.

How to Calculate Sample Size Correctly

Follow this process every time:

Define your metric: What are you measuring? (conversion rate, revenue per visitor, etc.)
Determine baseline: What's your current performance?
Set MDE: What's the minimum lift that matters to your business?
Choose power and significance: Typically 80% power, 95% confidence
Calculate sample size: Use our sample size calculator
Estimate duration: Sample size ÷ weekly qualified traffic = weeks needed
Decide if viable: If duration >8 weeks, consider testing something with higher impact

Key Takeaways

✓Always calculate sample size before launching tests
✓Be realistic about minimum detectable effect given your traffic
✓Use 80% power minimum and don't peek unless using sequential testing

You spend weeks setting up an A/B test. You wait for results. You celebrate a "winner." Then you implement it... and conversions don't improve—or worse, they drop.

What went wrong? Usually, sample size.

Sample size mistakes are the silent killers of A/B testing programs. They lead to false positives, wasted traffic, and misguided decisions. Here are the most common mistakes and how to avoid them.

Mistake #1: Not Calculating Sample Size Before Testing

The mistake: Launching a test without knowing how much traffic you need.

The fix: Always calculate required sample size before launching using baseline rate, MDE, power, and significance level.

Mistake #2: Setting Unrealistic MDE (Minimum Detectable Effect)

The mistake: Expecting to detect tiny lifts like 1-2% with limited traffic.

Why it's wrong: Detecting small effects requires massive sample sizes. A test looking for a 1% lift needs 100x more traffic than one looking for a 10% lift.

The reality check:

Low-traffic site (<10k visitors/month): Look for 20-30% lifts
Medium-traffic site (10-100k): Look for 10-15% lifts
High-traffic site (>100k): Can detect 5-10% lifts

The fix: Be realistic about what you can detect given your traffic. Focus on high-impact changes that could reasonably produce 10%+ lifts.

Mistake #3: Ignoring Statistical Power

The mistake: Only considering significance level (alpha) without thinking about power (1 - beta).

Why it's wrong: Low power means you'll miss real effects (false negatives). You'll conclude "no difference" when there actually is one.

What power means:

80% power: If a real effect exists, you have an 80% chance of detecting it
50% power: You're basically flipping a coin—might as well not test

The fix: Always use at least 80% power in your sample size calculations. Never compromise on power to run tests faster.

Mistake #4: Stopping at "Statistical Significance"

The mistake: Monitoring your test and stopping immediately when p < 0.05, regardless of whether you've reached your planned sample size.

The fix: Either:

Fixed-horizon testing: Decide sample size upfront and don't look until you reach it
Sequential testing: Use proper sequential sampling methodology that accounts for continuous monitoring

Mistake #5: Not Accounting for Weekday/Weekend Variation

The mistake: Running tests for exactly 7 days or stopping mid-week.

Why it's wrong: User behavior often differs dramatically between weekdays and weekends. Stopping mid-cycle can bias results.

The fix: Always run tests for full weeks (multiples of 7 days). If your business has monthly patterns, run for full months.

Mistake #6: Testing Too Many Variants at Once

The mistake: Running an A/B/C/D/E test with 5 variants simultaneously.

Why it's wrong: More variants = traffic split more ways = each variant needs much longer to reach significance. A 3-variant test needs ~3x the time of a 2-variant test.

The math:

2 variants (A/B): 16,000 visitors total
3 variants (A/B/C): 24,000 visitors total
5 variants (A/B/C/D/E): 40,000 visitors total

The fix: Stick to A/B tests (2 variants) unless you have massive traffic. If you must test multiple variations, use a multi-armed bandit algorithm instead.

Mistake #7: Ignoring Traffic Quality Differences

The mistake: Calculating sample size based on total traffic without accounting for traffic quality.

Why it's wrong: Not all traffic is equal. If you're testing a checkout change, only users who reach checkout matter. If 80% of traffic bounces before checkout, you need 5x more total traffic.

The fix: Calculate sample size based on qualified traffic that will actually see your test.

Real-World Example: A Mistake That Cost $50K

The scenario: E-commerce site testing a new product page design.

Baseline conversion rate: 3%
Wanted to detect: 5% lift (3% → 3.15%)
Actual traffic: 5,000 visitors/week
Test duration: 2 weeks
Total sample: 10,000 visitors (5,000 per variant)

What they did: After 2 weeks, they saw variant B performing 8% better with p = 0.04. They declared a winner and implemented B site-wide.

What went wrong: To detect a 5% lift with 80% power, they needed 80,000 visitors per variant—16x more than they ran. They only had 6% power. Their "significant" result was noise.

The result: After full implementation, conversions dropped 3%. They spent $50K in lost revenue and 3 months rolling back changes.

How to Calculate Sample Size Correctly

Follow this process every time:

Define your metric: What are you measuring? (conversion rate, revenue per visitor, etc.)
Determine baseline: What's your current performance?
Set MDE: What's the minimum lift that matters to your business?
Choose power and significance: Typically 80% power, 95% confidence
Calculate sample size: Use our sample size calculator
Estimate duration: Sample size ÷ weekly qualified traffic = weeks needed
Decide if viable: If duration >8 weeks, consider testing something with higher impact

Key Takeaways

✓Always calculate sample size before launching tests
✓Be realistic about minimum detectable effect given your traffic
✓Use 80% power minimum and don't peek unless using sequential testing

Sample Size Mistakes That Invalidate Your A/B Tests

Mistake #1: Not Calculating Sample Size Before Testing

Mistake #2: Setting Unrealistic MDE (Minimum Detectable Effect)

Mistake #3: Ignoring Statistical Power

Mistake #4: Stopping at "Statistical Significance"

Mistake #5: Not Accounting for Weekday/Weekend Variation

Mistake #6: Testing Too Many Variants at Once

Mistake #7: Ignoring Traffic Quality Differences

Real-World Example: A Mistake That Cost $50K

How to Calculate Sample Size Correctly

Key Takeaways

Sample Size Calculator

Sequential Testing Calculator

Get Expert Testing Guidance

Sample Size Mistakes That Invalidate Your A/B Tests

Mistake #1: Not Calculating Sample Size Before Testing

Mistake #2: Setting Unrealistic MDE (Minimum Detectable Effect)

Mistake #3: Ignoring Statistical Power

Mistake #4: Stopping at "Statistical Significance"

Mistake #5: Not Accounting for Weekday/Weekend Variation

Mistake #6: Testing Too Many Variants at Once

Mistake #7: Ignoring Traffic Quality Differences

Real-World Example: A Mistake That Cost $50K

How to Calculate Sample Size Correctly

Key Takeaways

Sample Size Calculator

Sequential Testing Calculator

Get Expert Testing Guidance