I got an email from a friend who’s been testing around with A/B testing for the past few days. Even if A/B testing isn’t the best approach to clickthrough optimization, it’s still a decent one, and he was pretty excited to see that his mad scientist tinkering was leading to meaningful results.
25.9% clickthrough on path A and 31.2% on path B. That’s almost a 20% increase and I’m thrilled! Do I set B to the default and start branching from there?
If you’re in the position where A/B testing is showing you a clear avenue to get more clicks/exposure/moolah, you’re going to be understandably excited. Still, it’s important to know what you’re looking at — knowing if your results are statistically valid — beyond knowing that higher numbers are better numbers.
I learned about the woes of statistical significance in seventh grade, when I applied to the Virginia Junior Academy of Science — basically Science Fair for middle schoolers. I wasn’t particularly enthusiastic about going, but applying exempted you from the final exam and getting in gave you an A for the class — naturally, I jumped at the chance.
My project was something involving RAM and processor speed. I don’t remember the details of the tests beyond that they were definitely amateurish — what I remember is performing the statistical ‘tests’ on them, dealing with things like chi-squares and alphas and student variates that I had never seen before and was confident I would never have to use again.
Yet again, seventh grade Justin was woefully incorrect.
Statistical significance is exactly what it sounds like: testing your data to make sure that what you’re getting is valid. It’s not just important just for research studies and upper-echelon analysis, but for making everyday business decisions.
Still, he was right in his conviction that wasting pages of looseleaf writing out standard deviations was not the correct way to do things (thanks for nothing, Mrs. Lavender). In therest of this post, I want to talk about how to easily check for statistical signifiance. I’ll be using Python as a proxy for pseudo-code, but this stuff is easily translatable to a variety of different languages.
The Sciency Parts
First, a vocabulary lesson: a null hypothesis is basically the assumption that whatever we’re testing is insignificant. In A/B testing, the null hypothesis is that there is no difference between the two paths/versions we’re testing, or:
Any differences between Version A and Version B are solely due to randomness.
By default, we assume the null hypothesis to be true. To find it false, we must conduct tests of statistical signifiance until we’re confident that differences between the two versions aren’t random. Since we can never be perfectly confident about anything, we decide to aim for a certain confidence level (or alpha) — which is exactly what it sounds like. If we aim for a confidence level of 99.9%, we better damn well be sure of our results before we reject the null hypothesis; alternatively, with a confidence level of, say, 80%, we’re giving our results a lot more leeway.
Running the Test
So, what is the test exactly? How do we decide if we’re confident?
Just like how one language isn’t perfect for all applications, there isn’t one magic statistical test that answers every single question you have — there are dozens of possible statistical frameworks, depending on:
- What do we already know?
- What are we testing? (Quantitative vs. Qualitative?)
- How many things are we testing?
- And many, many more questions
In this case, we’re going to use the Student’s t-test because it’s relatively simple and works well with A/B testing.
The fancy equation for a t-test:
def t_test(results_a, results_b): mean_difference = (mean(results_a) - mean(results_b)) grand_stdev = (.5 * (stdev(results_a) ** 2) * (stdev(results_b) ** 2)) ** 0.5 return abs(mean_difference / (grand_stdev * (2 / len(results_a)) ** 0.5))
Or, in mathematical terms:
(Note that you’re passing in the results as lists, not a precompiled mean or percentage; in A/B testing, the results would look like
[0,0,0,1,1,0,1,0] where ones are hits and zeroes are misses.)
The above function spits out what we call a “t-statistic”, a general measure of likelihood. We now compare this against a target t-value based off our sample size and our confidence level to see whether or not it’s significant. There are a lot of ways to do this comparison, but in the spirit of making life easy I’m going to just show you a nice link that calculates it for you (Note: degrees of freedom = sample size - 1.)
If your calculated t-statistic is greater than the outputted t-value, congrats! You are statistically valid!
So, the cliff notes version:
- Get a solid sample size of A/B testing.
- Now double that sample size.
- Plug those results into the above method — call this number t.
- Now plug the sample size into the above link — call this number x.
- If t > x, you win!
If you’re already using SciPy then this post probably didn’t help you out too much, but SciPy has a built-in t-test method that makes your life a breeze.
More broadly speaking, if you’re running a startup you might not think it’s worth your time to do the above testing — and you might be correct. Time is the biggest value generator, and your time might be spent better working on implementations than poring over A/B analysis. In that case, here is the big takeaway:
Get a big sample size. If you’re in the position to do A/B tests, it shouldn’t be tough to get around ten thousand impressions. If you think that that’s too much, you shouldn’t be doing A/B tests.
If you thought this was interesting or helpful, you should follow me on Twitter.