On the pitfalls of A/B testing

You’ve (hopefully) heard many things about A/B testing. People rave about it, and with good reason: It provides you with a solid framework on which to evaluate how a change you’ve made has affected your metrics. Gut feeling isn’t always enough, and hard numbers help tremendously in decisions. However, most people (and software packages) make a grave mistake when A/B testing. This post will help you recognize and avoid it.

In case you’re unfamiliar with A/B testing, here’s a small run-down (it might get a bit technical, but shouldn’t be too bad):

A/B testing: A primer

Which is better? One way to find out!

You have two different versions of a page, one is (hopefully) your current version, and the other is the version you want to change the page to. You want to see if you should actually make that change, so you send half your visitors to the first page and half to the second. You monitor how many of the visitors perform an action (e.g. sign up for your service) on each page, and you calculate the conversion rate for the old and new pages. Whichever page has the highest conversion rate is, obviously, the one you should use.

The main problem with A/B testing is that you can’t really be sure that the results were actually statistically significant, rather than a fluke. If you show one person the first page and they sign up, and show one person the second page and they don’t sign up, does this mean that the first page is better? Clearly not, since it might be pure luck. So, we need a way to tell whether or not the difference in rate is due to randomness.

Luckily, we do have a way to do that. We can use a formula that can tell us if the change actually is statistically significant. You give it the number of conversions and total number of visitors for each page, then magic happens and you get a probability that tells you how likely the change you have is actually significant. If you get, for example, 90%, it means that the odds are one in ten that the change was due to random chance.

Since we wanted to minimize the chance of changes making our user experience worse, we might decide to go for over 98%. This way, we might get random results once in fifty trials, which is pretty good.

Story time

A delightful and not creepy bedtime tale.

One day that seems years ago (because it was), I was making changes to the landing page of my full-text bookmarking app, historious to see how conversions changed, and used this exact formula. However, wanting to verify that my A/B testing code was implemented properly, I gave it two variations of the page that were exactly the same, and which shouldn’t have produced any significant difference. To my great surprise, I saw that my tests reported conversions being up by a whole 30%, with 99.8% confidence, on one of the two (identical) versions! How could it be 99.8% sure that one page is 30% better, when there was no change between the two?

I ran the tests for another three days, and the rates reverted much closer to 50%, and confidence was around 10%. This made for a much less interesting post, but at least was what I would expect to see. But what caused this overconfidence in the algorithm over some completely imagined improvement, getting it dangerously close to becoming a contestant in American Idol?

Caveat testor

Confidence intervals should not be used for stopping.
— Richard Nixon

When A/B testing, you need to always remember three things:

The smaller your change is, the more data you need to be sure that the conclusion you have reached is statistically significant.
Confidence intervals should not be used for stopping.
You have a sudden urge to visit my website again.

You should never be looking at your data and waiting for the confidence interval to surpass some large number so you can stop the test. The more you look at the confidence interval, the more probable it is that it will, at some point, exceed your expectations, exactly unlike a watched kettle.

This is because, every time you look, you sample the confidence interval, and the more samples you take, the more likely it is you’ll find one you like. This, much like driving away from an explosion, doesn’t mean that you can’t look, but you can’t stop if you like what you see. This is better detailed in Evan Miller’s excellent article¹.

The best way to counter this is to set a sample size beforehand, and commit to not stop the test until the sample size has been exceeded.

The sample size

So now you’re thinking “but what’s a good sample size for me to use?”. Unfortunately, there are no hard rules for this, as it depends on various factors. It depends on your current conversion rate, the conversion rate you want to detect (going from 1% to 2% might be doubling your conversion rate, but going from 10% to 20% is much easier to detect), the rate of false positives, false negatives, etc.

Evan Miller’s Sample Size Calculator² is a good way of deciding how many subjects you need. You should plug in your numbers, come up with an estimate of how many people you will need, and then not stop the test until you have had this many people in your test. Remember, it’s fine to look, just don’t stop the test before you reach the sample size you committed to.

The G-test

A better confidence metric to use is the G-test³, and you can find out more about how to use it in Ben Tilly’s tutorial⁴. Ben says that the G-test needs more than 10 failures and 10 successes to be accurate, so that’s a good minimum, but your confidence should be very high (on the order of 99.9%) to be sure you’re not falling prey to randomness.

People say that the G-test is hard to find, but science can help. You can use Ben Tilly’s G-test calculator⁵ to evaluate it on your numbers and see what sort of confidence you get. Ben suggests 99% confidence initially⁶ and lowering it the more data you get, but that is committing the cardinal sin of using confidence intervals for stopping. However, Ben argues that practicality should beat purity, and that a thousand or so trials should be more than enough (again, depending on the confidence), and Evan’s calculator[^evancalc] should give a good ballpark estimation of sample size.

Epilogue

As with most things, A/B testing requires a bit of knowledge to perform properly. Hopefully, by this point you know enough to not commit the cardinal sin of A/B testing, which is to not do A/B testing, but you also know enough to conduct your tests properly, without getting misled by false positives.

Do keep in mind, however, that intuition and experience are also useful predictors of viability for a change, and remember to factor those into every change you make and result you get. If a change seems too small to have led to such a spectacular result, you might want to leave the test running for a bit longer.

Conversely, if you feel that a test validates what you initially expected and would like to stop it early because the worse alternative is costing you money or is bad for your users, you might want to make the change immediately, and maybe redo the test at a later time, perhaps with an even more improved version.

If I’ve gotten anything wrong, or if you have something to add (or even a success story to share), please leave a comment!

Update: There is some good discussion in the Hacker News thread about this post.

Stavros' Stuff

On programming and other things.

On the pitfalls of A/B testing

Conceived on Jul 6, 2013

A/B testing: A primer

Story time

Caveat testor

The sample size

The G-test

Epilogue

Stavros

Guy who likes computers

Connect with me

This site is part of the webring:

Recent Posts

Made with ♥ in Greece

A/B testing: A primer

Story time

Caveat testor

The sample size

The G-test

Epilogue

Subscribe to my mailing list

Stavros

Guy who likes computers

Connect with me

This site is part of the webring:

Recent Posts

Made with ♥ in Greece