Bayesian A/B Tests

Here at RichRelevance we regularly run live tests to ensure that our algorithms are providing top-notch performance.  Our RichRecs engine, for example, displays personalized product recommendations to consumers, and the $$A/B$$ tests we run pit our recommendations against recommendations generated by a competing algorithm, or no recommendations at all.  We test for metrics like click-through-rate, average order value, and revenue per session.  Historically, we have used null hypothesis tests to analyze the results of our tests, but are now looking ahead to the next generation of statistical models.  Frequentist is out, and Bayesian is in!

Why are null hypothesis tests under fire?  There are many reasons [e.g. here or here], and a crucial one is that null hypothesis tests and p-values are hard to understand and hard to explain.  There are arbitrary thresholds (0.05?) and the results are binary – you can either reject the null hypothesis or fail to reject the null hypothesis.  And is that what you really care about? Which of these two statements is more appealing:

(1) “We rejected the null hypothesis that $$A = B$$ with a p-value of 0.043.”

(2) “There is an 85% chance that $$A$$ has a 5% lift over $$B$$.”

Bayesian modeling can answer questions like (2) directly.

What’s Bayesian, anyway?  Here’s a short but thorough summary [source]:

The Bayesian approach is to write down exactly the probability we want to infer, in terms only of the data we know, and directly solve the resulting equation […] One distinctive feature of a Bayesian approach is that if we need to invoke uncertain parameters in the problem, we do not attempt to make point estimates of these parameters; instead, we deal with uncertainty more rigorously, by integrating over all possible values a parameter might assume.

Let’s think this through with an example.  Assume your parameter-of-interest is click-through rate (CTR), and your $$A/B$$ test is pitting two different product recommendation engines against one another.  With null hypothesis testing, you assume that there exist true-but-unknown click-through rates for $$A$$ and $$B,$$ which we will write as $$text{CTR}_A$$ and $$text{CTR}_B,$$ and the goal is to figure out if they are different or not.

With Bayesian statistics we we will instead model the $$text{CTR}_A$$ and $$text{CTR}_B$$ as random variables, and specify their entire distributions (I’ll go through this example in more detail in the next section).  $$text{CTR}_A$$ and $$text{CTR}_B$$ are no longer two numbers, but are now two distributions.

Here’s a quick dictionary of Bayesian terms:

  • prior – a distribution that encodes your prior belief about the parameter-of-interest
  • likelihood – a function that encodes how likely your data is given a range of possible parameters
  • posterior – a distribution of the parameter-of-interest given your data, combining the prior and likelihood

So forget everything you know about statistical testing for now.  Let’s start from scratch and answer our customer’s most important question directly: what is the probability that $$text{CTR}_A$$ is larger than $$text{CTR}_B$$ given the data from the experiment (i.e. a sequence of 0s and 1s in the case of click-through-rate)?

To compute this probability, we’ll first need to find the joint distribution (a.k.a. the posterior):


and then integrate across area-of-interest.  What does that mean?  Well, $$P(text{CTR}_A,text{CTR}_B|text{data})$$ is a two-dimensional function of $$text{CTR}_A$$ and $$text{CTR}_B.$$  So to find $$P(text{CTR}_A>text{CTR}_B|text{data})$$ we have to add up all the probabilities in the region where $$text{CTR}_A>text{CTR}_B$$:

$$!P(text{CTR}_A > text{CTR}_B|text{data}) = iintlimits_{text{CTR}_A > text{CTR}_B} P(text{CTR}_A,text{CTR}_B|text{data}) dtext{CTR}_A dtext{CTR}_B.$$

To actually calculate this integral will require a few insights.  The first is that for many standard $$A/B$$ tests, $$A$$ and $$B$$ are independent because they are observed by non-overlapping populations.  Keeping this in mind, we have:

$$!P(text{CTR}_A,text{CTR}_B|text{data}) = P(text{CTR}_A|text{data}) P(text{CTR}_B|text{data}).$$

This means we can do our computations separately for $$text{CTR}_A$$ and $$text{CTR}_B$$ and then combine them at the very end to find the probability that $$text{CTR}_A > text{CTR}_B.$$  Then, applying Bayes rule to both $$P(text{CTR}_A|text{data})$$ and $$P(text{CTR}_B|text{data}),$$ we get:

$$!P(text{CTR}_A,text{CTR}_B|text{data}) = frac{P(text{data}|text{CTR}_A)P(text{CTR}_A) P(text{data}|text{CTR}_B) P(text{CTR}_B)}{P(text{data})P(text{data})}.$$

The next step is to define the models $$P(text{data}|cdot)$$ and $$P(cdot).$$  (We don’t need a model for $$P(text{data})$$ because, in practice, we’ll never have to use it to compute the probabilities of interest.)  The models are different for every type of test, and the simplest is…

Binary A/B Tests

If your data is a sequence of 0s and 1s, a binomial coin-flip model is appropriate.  In this case we can summarize each side of the test by the parameters $$text{CTR}_A$$ and $$text{CTR}_B,$$ where $$text{CTR}_A$$ is the probability of a 1 on the $$A$$ side.

We’ll need some more notation.  Let $$text{clicks}_A$$ and $$text{views}_A$$ be the number of clicks and the total number of views, respectively, on the $$A$$ side.  The likelihood is then:

$$!begin{align*}P(text{data}|A) &= P(text{views}_A, text{clicks}_A | text{CTR}_A )\\&= {text{views}_A choose text{clicks}_A} text{CTR}_A^{text{clicks}_A} left(1-text{CTR}_Aright)^{text{views}_A-text{clicks}_A},end{align*}$$

with a similar looking equation for the $$B$$ side.  Choosing the prior $$P(text{CTR}_A)$$ is a bit of a black art, but let’s just use the conjugate Beta distribution for mathematical & computational convenience (see here and here for more about conjugate priors).  Also, for the sake of fairness, we will use the same prior for $$text{CTR}_A$$ and $$text{CTR}_B$$ (unless there is a good reason to think otherwise):

$$!begin{align*}P(text{CTR}_A) &= text{Beta}(text{CTR}_A;alpha,beta)\\&= frac{1}{B(alpha,beta)}, text{CTR}_A^{alpha-1}(1-text{CTR}_A)^{beta-1},end{align*}$$

where $$B$$ is the beta function (confusingly, not the same as a Beta distribution), and $$alpha$$ and $$beta$$ can be set to reflect your prior belief on what $$text{CTR}$$ should be.  Note that $$P(text{CTR}_A)$$ has the same form as $$P(text{views}_A, text{clicks}_A | text{CTR}_A )$$ – that’s precisely the meaning of conjugacy – and we can now write the posterior probability directly as:

$$!begin{align*}P(text{CTR}_A |text{views}_A, text{clicks}_A) &= frac{P(text{views}_A, text{clicks}_A | text{CTR}_A ) p(text{CTR}_A)}{P(text{views}_A,text{clicks}_A)}\\&= {text{views}_A choose text{clicks}_A} frac{text{CTR}_A^{text{clicks}_A + alpha -1} left(1-text{CTR}_Aright)^{text{views}_A-text{clicks}_A + beta – 1}}{P(text{views}_A,text{clicks}_A)B(alpha,beta)}\\&propto text{Beta}(text{CTR}_A; text{clicks}_A + alpha, text{views}_A – text{clicks}_A + beta).end{align*}$$

(In practice it doesn’t really matter what prior we choose – we have so much experimental data that the likelihood will overwhelm the prior easily.  But we chose the Beta prior because it simplifies the math and computations.)

Now we have two Beta distributions the product of which is proportional to our posterior – what’s next?  We can numerically compute the integral we wrote down earlier!  In particular, let’s find

$$!P(text{CTR}_A > text{CTR}_B | text{data} = { text{views}_A, text{clicks}_A, text{views}_B, text{clicks}_B}).$$

To do so just draw independent samples of $$text{CTR}_A$$ and $$text{CTR}_B$$ (Monte Carlo style) from $$!text{Beta}(text{CTR}_A; text{clicks}_A + alpha, text{views}_A – text{clicks}_A + beta)$$ and $$!text{Beta}(text{CTR}_B; text{clicks}_B + alpha, text{views}_B – text{clicks}_B + beta)$$ as follows (in Python):

from numpy.random import beta as beta_dist
import numpy as np
N_samp = 10000 # number of samples to draw
clicks_A = 450 # insert your own data here
views_A = 56000
clicks_B = 345 # ditto
views_B = 49000
alpha = 1.1 # just for the example - set your own!
beta = 14.2
A_samples = beta_dist(clicks_A+alpha, views_A-clicks_A+beta, N_samp)
B_samples = beta_dist(clicks_B+alpha, views_B-clicks_B+beta, N_samp)

Now you can compute the posterior probability that $$text{CTR}_A > text{CTR}_B$$ given the data simply as:

np.mean(A_samples > B_samples)

Or maybe you’re interested in computing the probability that the lift of $$A$$ relative to $$B$$ is at least 3%.  Easy enough:

np.mean( 100.*(A_samples - B_samples)/B_samples > 3 )

Pretty neat, eh?  Stay tuned for the next blog post where I will cover Bayesian A/B tests for Log-normal data!

PS, How should you set your $$alpha$$ and $$beta$$ in the Beta prior?  You can set them both to be 1 – that’s like throwing your hands up and saying “all values are equally likely!”  Alternatively you can set $$alpha$$ and $$beta$$ such that the mean or mode of the Beta prior is roughly where you expect $$CTR_A$$ and $$CTR_B$$ to be.

Reference: Bayesian Data Analysis, Chapter 2

Share :
Related Posts
  • Sergey Feldman

    Hi David,

    Thanks for the comments. I glossed over the prior for sure! It can be tough to explain well, and others have done a great job elsewhere (I posted some links).

    As for the Bayes factor – I’m not sure I understand how one would use it here. AFAIK, the Bayes factor is useful for selecting one of a bunch of models M1, M2,… when fitting a set of data D. But that’s not the regime here. I have two sets of data – D1 from the A side and D2 from the B side, and I know exactly what kind of model to use for M1 and M2. How would one compute the Bayes factor in the two dataset case, and what information would it provide?

  • Reply

    “And there’s your owl!”

    You glossed over a critically important step: specifying the prior. You should clarify how you came to your choice for alpha and beta in this particular example. Also, since this is basically a hypothesis test, why not calculate a Bayes Factor?

  • Reply

    […] In my first blog post I introduced the general framework of Bayesian A/B testing, and went into gory details on how to make it happen with binary data: 0, 1, 0, 1, 0, 0, 0… […]

  • Reply

    […] how does one actually do this?  We only had one parameter last time (which was headache enough) and now there are […]

  • Sergey Feldman

    Lakshmi: to think of the click-through-rate as being a distribution (instead of a single value) is what’s Bayesian about this approach. In frequentist statistics, the assumption is that there is some true-but-unknown click-through-rate; null hypothesis tests reflect this assumption. I should probably have been a little clearer: null hypothesis tests are generally thought of as frequentist, and full posterior modeling (a la this blog post) as Bayesian.

  • lakshmi

    I am not sure I understand what you mean by “Frequentist is out, and Bayesian is in!” – aren’t you using frequencies to calculate probabilities in Bayesian as well?

  • Sergey Feldman

    Thanks for the catch Tom!

  • tom

    Your python variable naming in your example is a little off, you can’t import beta then declare it a float, then call it.

  • Sergey Feldman

    Thanks Amy. It’s much simpler than Gibbs sampling! I’m just drawing iid samples from a one dimensional distribution.

  • Amy

    Good post! Is this the same as Gibbs sampling??

Leave Your Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.