Split Testing (A/B Testing)

Split testing — also called A/B testing or controlled experimentation — is the practice of simultaneously exposing two or more distinct groups of customers to different product versions and measuring the difference in outcomes. By running variants concurrently rather than sequentially, split testing isolates a single variable and provides causal evidence that a change produced an observed effect.

The mechanics are straightforward:

  • Divide an incoming user population randomly into a control group (existing version) and one or more treatment groups (variant versions)
  • Expose each group to their respective version for a defined period
  • Measure a pre-specified metric for each group
  • Test whether the observed difference exceeds what chance alone would produce

Why Simultaneous Beats Sequential

The critical design choice is simultaneity. Without it, any observed change between “before” and “after” is confounded by seasonal effects, external events, changes in user mix, and the natural drift of user behaviour over time. Two sequential releases cannot tell you which of many changes caused an improvement — or whether the improvement would have happened anyway.

Ries makes a counter-intuitive but well-supported argument: continuous split testing, though it seems like more overhead, is more efficient than large-batch sequential releases. When a team releases 40 features at once and metrics improve, they cannot attribute the improvement. With split tests, they know immediately. The feedback loop is faster despite the apparent additional work.

Eric Schmidt described Google’s culture of constant experimentation along similar lines: the company runs thousands of experiments simultaneously, not to increase work, but because experimentation at scale is the only reliable way to know what works.

Statistical Validity Requirements

Split tests are only valid under specific conditions:

  • Sufficient sample size: Under-powered tests produce false negatives (real effects not detected) or false positives. Sample size must be calculated in advance based on expected effect size and desired statistical power (conventionally 80%).
  • Pre-registration of metrics: Choosing which metric to measure after seeing results inflates false positive rates (the “HARKing” problem — Hypothesizing After Results are Known).
  • No peeking: Stopping a test early when results look promising violates the statistical assumptions. This “peeking problem” systematically overstates effect sizes.
  • Independence: Results are invalidated when users in one group can affect users in another (a problem for social networks and marketplace products where network effects exist).

Limitations and Failure Modes

Split testing measures short-term behaviour accurately but can miss long-term effects: a change that increases day-1 engagement may reduce 30-day retention. “Novelty effects” — where any change temporarily boosts engagement because it is unfamiliar — are a known confound. Booking.com, which runs 1,000+ simultaneous experiments, has invested substantially in statistical infrastructure to detect and correct for these effects.

The practice connects directly to Cohort-Analysis: cohort segmentation is often necessary to detect whether a treatment’s effect differs across user types. It operationalises Innovation-Accounting by generating the causal evidence that innovation accounting requires to distinguish genuine progress from optimised optics.

Future Connections

Will connect to Continuous-Deployment when created.

Sources

  • Ries, Eric (2011). The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Publishing. ISBN: 978-0-307-88791-7.

    • Chapter 7 (Measure) — split testing as the tool that makes actionable metrics possible; Chapter 9 (Batch) — connection between small-batch releases and continuous experimentation
  • Kohavi, Ron, Diane Tang, and Ya Xu (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. ISBN: 978-1-108-72426-4.

    • Authoritative industry treatment of A/B testing at scale; statistical validity requirements, peeking problem, novelty effects, and network effect violations at Microsoft, LinkedIn, and Airbnb
  • Kohavi, Ron, Roger Longbotham, Dan Sommerfield, and Randal M. Henne (2009). “Controlled Experiments on the Web: Survey and Practical Guide.” Data Mining and Knowledge Discovery, Vol. 18, No. 1, pp. 140–181. DOI: 10.1007/s10618-008-0114-1

    • Seminal academic paper establishing statistical foundations for web-scale A/B testing; sample size calculation, type I/II error tradeoffs, and the practical constraints of online experiments
  • Fabijan, Aleksander, Pavel Dmitriev, Helena Holmström Olsson, and Jan Bosch (2017). “The Evolution of Continuous Experimentation in Software Product Development.” Proceedings of the 39th International Conference on Software Engineering (ICSE). Buenos Aires, Argentina. DOI: 10.1109/ICSE.2017.76

    • Empirical study of how organisations evolve from occasional A/B tests to continuous experimentation cultures; connects experimentation infrastructure to software architecture decisions
  • Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn (2011). “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science, Vol. 22, No. 11, pp. 1359–1366. DOI: 10.1177/0956797611417632

    • Foundational paper on the peeking problem and HARKing; explains why pre-registration of metrics and stopping rules is necessary for valid inference

Note

This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.