"When the facts change, I change my mind. What do you do, sir?" - John Maynard Keynes (maybe)

A/B testing as the name suggests, is a method of statistically testing two different versions of a design pattern to determine difference of effectiviness. Simply put, an A/B test is a way to compare two versions of a single variable by testing a subject's response to one version against the other, and determining which of the two is more effective.

I work in the biopharma industry and A/B tests come up all the time. Our aim lies in calculating the effectiveness of a certain drug A versus drug B. So we would test the drug A on some percentage of a patient group and the drug B on the rest of the patient group. After sufficient trials are performed, the in-house statisticians (my SAS people!) measure the efficacy to ascertain which drug gave better results.

Along the same lines, front end app engineers or website designers are interested in knowing which particular design of their product (website / mobile app etc) would give rise to more conversions . A conversion is determined by an action such as a visitor signing up, buying something etc. This gives rise to a conversion rate which is defined by the ratio of the desired conversion achieved (say sign-ups) to the traffic on the page (say clicks).

In the case of a web page our A and B protoypes will have alternate designs. This could be anything from the color of a button to the layout of the page. The product designers route some fraction of visitors to site A and the rest to site B, and record if the visits by users yielded a desired conversion or not. These conversions are recorded and studied afterwards.

In classical statistics, often abbreviated as frequentist statistics this post-experiment analysis is done using a hypothesis test. This would involve (to the horror of many) calculation of infamous p-values and Z scores to determine statistical significance. This article focuses on bayesian statistics. In frequentist statistics we assume the parameter of our interest is a fixed constant. We calculate the probability of seeing the observed set of data given the parameter of interests. In bayesian statistics, our parameter of interest is assumed to be uncertain, and we capture this uncertainity as a prior distribution. Then we focus on calculating a posterior distribution representing our posterior uncertainity in the parameter of interest. This is done after observing the data.

Okay, the TL DL of it, the true frequency of obtaining a head on an unbiased coin toss is 1/2. However, if we toss the same coin 5 times, we still might not have a head at all. This would be the observed frequency in this case. We would be using bayesian statistics to calculate this true frequency for our conversion event using an approriate prior and the observed data.

Let's focus on the analysis of site prototype A. Let's run some basic assumptions :

P_A = Probability that users who are shown A convert
N = number of people A was shown to
n = number of people who converted

Here it might safe to assume that P_A = n/N, but that's not the case. It's same case of observed frequencies vs true frequencies of the example mentioned above. Let's get our hands dirty with a working example.

I'm generating some data below for two website prototypes A, B. Here we model N as the number of clicks received on the website and n as the number of people who signed up. Such a dataset in real life would follow a bionomial distribution.

Now before I go into details I should explain what a probability distribution is. Think of things that can happen, say - coins are tossed, dice are rolled, flights take off etc. After the event has occurred you get a specific outcome, in this case - the coin toss resulted in a heads, the dice came up 5 and the flight took off half an hour late. But before a particular outcome is observed, we can only project how likely a particular outcome is. This projection is done through probability distributions which describe what the probability of each outcome is. For example, a coin toss has 0.5 probablity of a heads and 0.5 probability of a tails. This is a probability distribution of two possible outcomes of a coin toss. A bionomial distribution models the number of successes in a sequence of N indepedent yes/no events, each of which has succcess with a certain probability. Think of number of clicks vs number of sign-Ups (yes/no events) in our case.

So let's assume -

Prototype A - Got 1065 clicks for 69 sign ups
Prototype B - Got 1039 clicks for 58 sign ups

In [231]:

import pymc as pm
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Assign 1065 clicks resulting in 69 sign ups for A
# np.hstack takes a sequence of arrays and stacks them horizontally to make a single array.
A = np.hstack(([0]*(1065-69),[1]*69))

# Assign 1039 clicks resulting in 58 sign ups for B
B = np.hstack(([0]*(1039-58),[1]*58))

from pylab import rcParams
rcParams['figure.figsize'] = 12,5

Priors¶

Our goal lies in using what we know, that is the total number of clicks and the total number of sign ups, to estimate the true frequency of conversion (P_A). To set up a Bayesian model we would need to assign prior distribution to what we think P_A might be. A good starting point would be to assume that P_A has a uniform distribution over [0,1]. A uniform distribution assigns equal probability weight to all points in the domain (0,1), also known as the support of the distribution. We will repeat the same scenario for the B prototype as well and estimate a prior for P_B (probability that users who are shown B convert).

Now if we were to consider each possbile sign-up, we can model it as an event which follows a Bernoulli distribution. If you read my coint toss example above, you already know what a Bernoulli distribution is. Tossing a fair coin can get you either a heads or a tails with equal probability of 0.5 for both. Simply put, these probabilities of the two outcomes form a Bernoulli distribution. We can generalize it and say in the case of an coin toss (fair or unfair), the coin lands head with p and tails with probability 1-p.

We can map a Bernoulli distribution to our website example and say, when the user clicks on our A prototype, he tosses a coin, if the coin lands heads, he signs up, otherwise he doesn't. We can repeat the same scenario for our prototype B. The essence being that a sign-up on either A or B can be modelled intuitively as a Bernoulli distribution. For our experiment we want to determine whether P_A > P_B or vice versa therefore we are interested in the difference of the two. Hence, we define a difference function that creates the posterior distribution for the difference of the two.

Posterior¶

After creating observations variables for each prototype using a Bernoulli distribution we run a Markov Chain Monte Carlo (MCMC) to sample data from the each prototype's posterior distribution. I will not go into the detail of the algorithm here. MCMC methods are a class of algrothims that let you sample a posterior distribution by constructing an equilibrium distribution that has the properties of the desired distribution. The first moves of the algorithm are not very reflective of our posterior distribution as it takes time to converge to equilibirum. Therefore, we throw out the initial samples returned.

In [232]:

# Uniform distribution of prior probability
pA = pm.Uniform('pA', 0,1)
pB = pm.Uniform('pB',0,1)

#posterior distribution of B - A
@pymc.deterministic
def difference(pA = pA, pB = pB):
    return pA - pB

# Bernoulli variables for observation
oA = pm.Bernoulli('oA', pA, value = A , observed = True)
oB = pm.Bernoulli('oB', pB, value = B , observed = True)

# Model and sample
model = pm.Model([pA, pB, difference, A, B])
mcmc = pm.MCMC(model)

# Sample 25000 points and throw out the first 5000
mcmc.sample(25000, 5000)

 [-----------------100%-----------------] 25000 of 25000 complete in 2.2 sec

Now that we have ran our model let's take a look at the posterior distribution of P_A and P_B.

In [233]:

a_post = mcmc.trace('pA')[:]
b_post = mcmc.trace('pB')[:]
sns.kdeplot(a_post, shade=True, color="b", label= "Posterior for protoype A")
sns.kdeplot(b_post, shade=True, color="r", label= "Posterior for prototype B")

/Users/Guneet/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[233]:

More importantly let's look at the posterior of the difference distribution (P_A - P_B)

In [234]:

# posterior results protype A - prototype B
delta_D = mcmc.trace('difference')[:]
sns.kdeplot(delta_D, shade=True, color="g", label = "Posterior of difference")

/Users/Guneet/anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Out[234]:

Looking at the above graph we see that most of the distribution lies to the right of x = 0. This implies that most of the points sampled from prototype A's distribution are larger than the points sampled from prototype B's distribution since we calculated P_A - P_B.

Results¶

From the above graphs we can get some answers to our business question of comparing design A vs design B. For the traffic received on A vs B, we compute the probability that design A got more sign ups than B as -

In [235]:

#Area under the curve greater than 0, probability that A is better than B
print("Probability that prototype A gets more sign ups than site B is", (delta_D > 0).mean())

#Area under the curve less than 0, probability that B is better than A
print("Probability that prototype B gets more sign ups than site A is", (delta_D < 0).mean())

Probability that prototype A gets more sign ups than site B is 0.79525
Probability that prototype B gets more sign ups than site A is 0.20475

Therefore we see that in this scenario, design A outperforms design B when getting users to sign up is concerned.

Though I would like to point out that we went ahead with using a uniform distribution when we were calculating our prior probabilities for A and B. Let's look into the uniform distribution a little more closely. Consider the uniform distribution over the interval (0,1) we have -

In [236]:

from scipy.stats import uniform
test = uniform()
y = np.linspace(0,1,500)[1:-1]

plt.plot(y, test.pdf(y))
plt.title("Uniform Distribution")

Out[236]:

As you can see above, it’s a simple distribution. Like mentioned previously it assigns equal probability weight to all points in the domain (0,1), also known as the support of the distribution. What if we want a distribution that is not flat everywhere over (0,1) ? What do we do then?

Beta Distribution¶

This is where the Beta distribution comes in. The Beta distribution can be seen as a generalization of the Uniform(0,1) as it lets more general probability density functions over the interval (0,1) to be defined. Think of it as a versatile family of probability distributions over (0,1) which allows us to create priors that incorporate some of our beliefs and thus informative priors. When you have k-succeses out of n trials the posterior distribution can be modelled as Beta(k+1, n-k+1).

If you have a coin with an unknown bias p and that coin lands heads-up on 70 out of 100 flips, then p ~ Beta(71,31). If you send out an email campaign and get 100 conversions out of 10,000 emails sent, then the true conversion rate p for the campaig will be Beta(101, 9901) distributed. I'll keep things simple and visual here. Using two parameters a and b, the Beta(a,b) distribution can be visualised as -

In [237]:

aval = np.linspace(1,10,5)
bval = np.linspace(1,10,5)

for a,b in zip(aval, bval):
    plt.plot(y, beta(a,b).pdf(y), label="Beta(%s,%s)" %(a,b))
    plt.legend()
    plt.title("Beta Distribution")

If we look carefully we can see that in the above plot that the blue line corresponding to the distribution Beta(1,1) is the same as that of Uniform(0,1), confirming that it is a generalization of the uniform distribution. I am not specifying the equation governing the probablity density function of the Beta distribution here, but if you substitute a and b as (1,1) you'll get the uniform distribution over (0,1).

Let's see how the Beta distribution can help us quantify and improve the uncertainty in our A/B testing estimates. Let's take the same A and B prototypes again, this time around I'll just assign the clicks and sign-ups differently. Remember as always, a sign up on either A or B implies a conversion.

In [238]:

from scipy.stats import beta
A_site = {"clicks":1065, "conversions":69}
B_site = {"clicks":1039, "conversions":58}

In [239]:

print(" A's observed Conversion Rate is %.4f"%(A_site['conversions']/A_site['clicks']))
print(" B's observed Conversion Rate is %.4f"%(B_site['conversions']/B_site['clicks']))

 A's observed Conversion Rate is 0.0648
 B's observed Conversion Rate is 0.0558

Now let's try modelling our posterior using the Beta distribution. We can take a flat, uniform prior for the conversion rate. We add '1' for the influence of our uniform prior.

In [240]:

def posteriors(x, data, prototype):
    #plot posterior of A
    y = beta.pdf(x, 1 + data['conversions'], 1 + data['clicks'] - data['conversions'])
    plt.plot(x, y, label = prototype)
    plt.legend()
    return

x = np.linspace(0.02,0.10,1065)

posteriors(x, A_site, 'A')
posteriors(x, B_site, 'B')

As seen above A is to the right of B, this is because A had more conversions as compared to B. We calculated A's observed conversion rate to be higher than that of B's.

However, in reality we might have that the true conversion rate of B could be higher. That means, if you look at the graph above, the true value of B (in green) might be the right tail, and the true value of A might be the left tail (in blue). Now we want to how often can that occur. How often does a value of B fall above a value of A? To calculate this, we'll get some samples from the posterior distribution.

The code below uses the beta.rvs to sample random values from the posterior. We specify sample size as 25,000 samples and calculate the probability of conversion rate of B being greater than A's. A neat way to write this would be -

P(B > A) | data)

In [241]:

def sample_pos(A_site,B_site):
    
    A_samples = beta.rvs(1 + A_site['conversions'], 1 + A_site['clicks'] - A_site['conversions'], size=25000) 
    B_samples = beta.rvs(1 + B_site['conversions'], 1 + B_site['clicks'] - B_site['conversions'], size=25000)
    return A_samples, B_samples

A_samples, B_samples = sample_pos(A_site,B_site)
result = (B_samples > A_samples).mean()
print("Probability of B being greater than A is ", result)

Probability of B being greater than A is  0.19568

Conversely, we can calculate A being greater than B as

P(A > B) | data)

In [242]:

result2 = (A_samples > B_samples).mean()
print("Probability of A being greater than B is ", result2)

Probability of A being greater than B is  0.80432

Give the current observations you can decide that our A prototype will have higher a conversion rate than B (~80%), but there's still ~20% chance that you might be wrong.

Let's take another data-set which is more skewed to one of the designs. Consider the following -

Prototype A - Got 1065 clicks for 90 sign ups
Prototype B - Got 1039 clicks for 58 sign ups

In [243]:

A_site2 = { "clicks":1065, "conversions":85 }
B_site2 = { "clicks":1039, "conversions":58 }

print("A's observed Conversion Rate: ",(A_site2['conversions']/A_site2['clicks']))
print("B's observed Conversion Rate: ",(B_site2['conversions']/B_site2['clicks']))

x = np.linspace(0.02,0.12,1065)
posteriors(x, A_site2, 'A')
posteriors(x, B_site2, 'B')

A's observed Conversion Rate:  0.07981220657276995
B's observed Conversion Rate:  0.05582290664100096

Now, let's re-calculate our probability -

P(A > B | Data)

In [244]:

A_samples, B_samples = sample_pos(A_site2,B_site2)
results3 = (A_samples>B_samples).mean()
print("Probability of A being greater than B is ", results3)

Probability of A being greater than B is  0.98424

As we see above, we have ~ 98.5% chance that A's conversion rate is higher than B's conversion rate . This is akin to a intuitive Bayesian p-value, the probability that A is greater than B, given the observed data. This is NOT the same as familiar frequentist p-value which comes from hypothesis testing and Z-scores.

Even with this skewed example we have a 1.5% chance of being wrong. It could be possible (although unlikely) that the true value for prototype A lies in it's left tail, and the true value of protoype B lies in it's right tail.

Bayesian Advantages¶

Bayesian A/B testing gracefully incorporates unequal sample sizes
The Bayesian approach holds lot of benefits over traditional tests. You can see the uncertainity in individual conversion rate estimates. If you had less datapoints in one group as compared to the other, you would see more uncertainity in that group. Let's take an example -

In [245]:

A3 = {'clicks': 2250, 'conversions': 1450}
B3 = {'clicks': 250, 'conversions': 150}

x = np.linspace(0.4, 0.9, 1000)
posteriors(x, A3, 'A')
posteriors(x, B3, 'B')

As you can see from looking above, there's more uncertainity in the value of B as compared to A as it had less data. It could lie between anywhere 0.5 and 0.7. On the other hand we have a much more tighter distribution for A owing to more data. We are more certain and confident on what all possible values it can take. Effects of unequal sample sizes are reflected in the uncertainty of the posteriors in Bayesian A/B testing.

It let's you calculate the probability of being wrong
For less risky experiments you can take more risks. You can conclude experiments with less certainity for example at 90% certainty as compared to 98% certainity. And on the opposite end, for important experiments you can also hold out till the probability is close to 1.

It allows you to calculate relative increases in conversion rate
In addition to telling you which group's conversion rate is more, it can also tell you by how much is the conversion rate larger. Classical statistics will tell you significance, but bayesian testing can give you the magnitude. Using your posterior samples you can calculate relative increases as -

In [246]:

relative_increases = (A_samples - B_samples)/B_samples
plt.hist(relative_increases, bins=25, label = "Relative increaes in A as compared to B")
plt.legend()

Out[246]:

Next Steps¶

Our above analysis hold's good for two website prototypes. But what if you have more designs? Call it A/B/C/D/E testing. In this case, due to multiple tests our risk of having false positives increases.

A false positive occurs when your model says that an outcome should be 'true', but in reality the model turns out to be wrong (false). Take an example of testing 10 pairs or prototypes, assume that we have a 95% of chance of being correct in our prediction. For all tests taken together our chance of being correct is now (0.95)¹⁰ which is ~ 0.60. This puts our chances of being wrong at 0.40 or a false positive rate 40%. Therefore our false positive rate increases as the number of groups increases.

There are a couple of ways around this like the Bonferroni Correction which states to lower our threshold for statistical significance (alpha, generally set at 0.05), as our groups increase in number. Another technique which tackles the situation of multiple testing is hierarchical modelling. I'll save both of them for a future part II post. This article builds on the teachings of Cameron David-Pilon and the powerful Count Bayesie. Do check out his blog. Thank you for reading this far.

A/B Testing the Bayesian Way