The Tipping Point for Outliers in A/B Testing
Malcolm Gladwell recently popularized the term ‘outlier’ when referring to successful individuals. In data terms, however, outliers are data points that are far removed from other data points, or flukes. Though they will make up a small portion of your total data population, ignoring their presence can jeopardize the validity of your findings. So, what exactly are outliers, how do you define them, and why are they important?
A common A/B test we like to perform here at RichRelevance is comparing a client’s site without our recommendations against with our recommendations to determine the value. A handful of observations (or even a single observation) in this type of experiment can skew the outcome of the entire test. For example, if the recommendation side of an A/B test has historically been winning by $500/day on average, an additional $500 order on the No Recommendation side will single-handedly nullify the apparent lift of the recommendations for that day.
This $500 purchase is considered an outlier. Outliers are defined as data points that strongly deviate from the rest of the observations in an experiment – the threshold for “strongly deviating” can be open to interpretation, but is typically three standard deviations away from the mean, which (for normally distributed data) are the highest/lowest 0.3% of observations.
Variation is to be expected in any experiment, but outliers deviate so far from expectations, and happen so infrequently, that they are not considered indicative of the behavior of the population. For this reason, we built our A/B/MVT reports to automatically remove outliers, using the three standard deviations from the mean method, before calculating results, mitigating possible client panic or anger caused by skewed test results from outliers.
At first glance, it may seem odd to proactively remove the most extreme 0.3% of observations in a test. Our product is designed to upsell, cross-sell, and generally increase basket size as much as possible. So, in an A/B test like the above, if recommendations drive an order from $100 to $200, that’s great news for the recommendations side of the test – but if the recommendations are so effective that they drive an order from $100 to $1,000, that’s bad news because a $100 order has become an outlier and now gets thrown out.
In order for a test to be statistically valid, all rules of the testing game should be determined before the test begins. Otherwise, we potentially expose ourselves to a whirlpool of subjectivity mid-test. Should a $500 order only count if it was directly driven by attributable recommendations? Should all $500+ orders count if there are an equal number on both sides? What if a side is still losing after including its $500+ orders? Can they be included then?
By defining outlier thresholds prior to the test (for RichRelevance tests, three standard deviations from the mean) and establishing a methodology that removes them, both the random noise and subjectivity of A/B test interpretation is significantly reduced. This is key to minimizing headaches while managing A/B tests.
Of course, understanding outliers is useful outside of A/B tests as well. If a commute typically takes 45 minutes, a 60-minute commute (i.e. a 15-minute-late employee) can be chalked up to variance. However, a three-hour commute would certainly be an outlier. While we’re not suggesting that you use hypothesis testing as grounds to discipline late employees, differentiating between statistical noise and behavior not representative of the population can aid in understanding when things are business as usual or when conditions have changed.