Wise Uplift
ServicesProcessAbout
Get a Proposal
Back to Blog
Testing Pitfalls6 min readDec 5, 2024

Sample Size Mistakes That Invalidate Your A/B Tests

Common errors in calculating sample size and how to avoid them for reliable, actionable test results. Don't waste your traffic on invalid tests.

You spend weeks setting up an A/B test. You wait for results. You celebrate a "winner." Then you implement it... and conversions don't improve—or worse, they drop.

What went wrong? Usually, sample size.

Sample size mistakes are the silent killers of A/B testing programs. They lead to false positives, wasted traffic, and misguided decisions. Here are the most common mistakes and how to avoid them.

Mistake #1: Not Calculating Sample Size Before Testing

The mistake: Launching a test without knowing how much traffic you need, then stopping when you "feel" you have enough data.

Why it's wrong: You can't determine if a test is conclusive by gut feeling. You might stop too early (false positives) or run too long (wasted time).

The fix: Always calculate required sample size before launching. You need to know:

  • Your baseline conversion rate
  • Minimum detectable effect (MDE) you care about
  • Desired statistical power (typically 80%)
  • Significance level (typically 95%)

Use our sample size calculator to determine exactly how many visitors you need per variant.

Example Calculation

  • Baseline conversion rate: 5%
  • Minimum lift you care about: 10% relative (5% → 5.5%)
  • Power: 80%
  • Significance: 95%
  • Required sample size: ~16,000 visitors per variant

Mistake #2: Setting Unrealistic MDE (Minimum Detectable Effect)

The mistake: Expecting to detect tiny lifts like 1-2% with limited traffic.

Why it's wrong: Detecting small effects requires massive sample sizes. A test looking for a 1% lift needs 100x more traffic than one looking for a 10% lift.

The reality check:

  • Low-traffic site (<10k visitors/month): Look for 20-30% lifts
  • Medium-traffic site (10-100k): Look for 10-15% lifts
  • High-traffic site (>100k): Can detect 5-10% lifts

The fix: Be realistic about what you can detect given your traffic. Focus on high-impact changes that could reasonably produce 10%+ lifts.

Mistake #3: Ignoring Statistical Power

The mistake: Only considering significance level (alpha) without thinking about power (1 - beta).

Why it's wrong: Low power means you'll miss real effects (false negatives). You'll conclude "no difference" when there actually is one.

What power means:

  • 80% power: If a real effect exists, you have an 80% chance of detecting it
  • 50% power: You're basically flipping a coin—might as well not test

The fix: Always use at least 80% power in your sample size calculations. Never compromise on power to run tests faster.

Mistake #4: Stopping at "Statistical Significance"

The mistake: Monitoring your test and stopping immediately when p < 0.05, regardless of whether you've reached your planned sample size.

Why it's wrong: This is called "peeking" or "p-hacking." Checking results repeatedly and stopping at the first sign of significance dramatically inflates your false positive rate from 5% to 20-30%.

The fix: Either:

  • Fixed-horizon testing: Decide sample size upfront and don't look until you reach it
  • Sequential testing: Use proper sequential sampling methodology that accounts for continuous monitoring

Mistake #5: Not Accounting for Weekday/Weekend Variation

The mistake: Running tests for exactly 7 days or stopping mid-week.

Why it's wrong: User behavior often differs dramatically between weekdays and weekends. Stopping mid-cycle can bias results.

The fix: Always run tests for full weeks (multiples of 7 days). If your business has monthly patterns, run for full months.

Mistake #6: Testing Too Many Variants at Once

The mistake: Running an A/B/C/D/E test with 5 variants simultaneously.

Why it's wrong: More variants = traffic split more ways = each variant needs much longer to reach significance. A 3-variant test needs ~3x the time of a 2-variant test.

The math:

  • 2 variants (A/B): 16,000 visitors total
  • 3 variants (A/B/C): 24,000 visitors total
  • 5 variants (A/B/C/D/E): 40,000 visitors total

The fix: Stick to A/B tests (2 variants) unless you have massive traffic. If you must test multiple variations, use a multi-armed bandit algorithm instead.

Mistake #7: Ignoring Traffic Quality Differences

The mistake: Calculating sample size based on total traffic without accounting for traffic quality.

Why it's wrong: Not all traffic is equal. If you're testing a checkout change, only users who reach checkout matter. If 80% of traffic bounces before checkout, you need 5x more total traffic.

The fix: Calculate sample size based on qualified traffic that will actually see your test.

Real-World Example: A Mistake That Cost $50K

The scenario: E-commerce site testing a new product page design.

  • Baseline conversion rate: 3%
  • Wanted to detect: 5% lift (3% → 3.15%)
  • Actual traffic: 5,000 visitors/week
  • Test duration: 2 weeks
  • Total sample: 10,000 visitors (5,000 per variant)

What they did: After 2 weeks, they saw variant B performing 8% better with p = 0.04. They declared a winner and implemented B site-wide.

What went wrong: To detect a 5% lift with 80% power, they needed 80,000 visitors per variant—16x more than they ran. They only had 6% power. Their "significant" result was noise.

The result: After full implementation, conversions dropped 3%. They spent $50K in lost revenue and 3 months rolling back changes.

How to Calculate Sample Size Correctly

Follow this process every time:

  1. Define your metric: What are you measuring? (conversion rate, revenue per visitor, etc.)
  2. Determine baseline: What's your current performance?
  3. Set MDE: What's the minimum lift that matters to your business?
  4. Choose power and significance: Typically 80% power, 95% confidence
  5. Calculate sample size: Use our sample size calculator
  6. Estimate duration: Sample size ÷ weekly qualified traffic = weeks needed
  7. Decide if viable: If duration >8 weeks, consider testing something with higher impact

Key Takeaways

  • Always calculate sample size before launching tests
  • Be realistic about minimum detectable effect given your traffic
  • Use 80% power minimum—never compromise on power
  • Don't peek at results unless using sequential testing methodology
  • Run tests for full weeks/months to account for cyclical patterns
  • Stick to A/B (2 variants) unless you have massive traffic
  • Calculate based on qualified traffic that actually sees your test

Sample Size Calculator

Calculate exactly how much traffic you need for valid results.

Use Calculator

Sequential Testing Calculator

Monitor tests continuously without inflating errors.

Use Calculator

Get Expert Testing Guidance

Wise Uplift designs statistically rigorous testing programs that avoid these pitfalls and deliver reliable results.

Get a Proposal
W
Wise Uplift

Data-driven conversion rate optimization with 6+ years of experience helping businesses turn traffic into revenue.

Services

  • CRO Audits
  • A/B Testing
  • Landing Page Optimization
  • Funnel Optimization
  • Analytics Setup

Resources

  • Blog
  • CRO Glossary
  • Guides & Checklists
  • Free Calculators

Company

  • About Us
  • Our Process
  • Get a Proposal
  • Privacy Policy
  • Terms of Service

© 2025 Wise Uplift. All rights reserved.