How To Not Screw Up A / B Testing Your App

This is a guest post by Tim Levine, the data guru at SocialMedia.com.

How many of your app design discussions have ended with: “Okay, let’s test it”? Hopefully you didn’t answer ‘none’, because testing is the fastest way to learn. But it can also misinform business decisions if done carelessly. Instead, I want to help you be careful in the simplest and most common kind of test: The A/B test. Of course, a well-designed multi-variate test is the best use of resources, but we have to roll over before we can crawl.

The most common mistake in A/B testing is not running enough ‘trials.’ The second is running too many. The latter risks wasting potentially more productive opportunities, but the former is far worse because you risk managing by noise. How wasteful it would be to invest in a redesign because of a difference that might just be from random variation!

The table below shows some typical productive opportunities that you might present to a user. If the user pursues the opportunity, then the trial becomes a ‘success.’ Otherwise, it becomes a ‘failure’. Typical success rates and appropriate sample sizes are included as well. Note that ‘A’ would be one ‘cell’ (group of users) and ‘B’ would be another.

User has the opportunity toTypical Success rate
(successes/opportunities)
Number of trials needed per cell
(see below)
Click an ad0.5%52k impressions
Invite friends12%1871 new users
Proceed to next stage of flow40%366 users entering flow
Click Ad then purchase something0.001%2.6MM impressions

Since SocialMedia enables developers to monetize their apps through ads, let’s suppose that you were considering moving an ad from the top of a page to the bottom. In this example, the ad at the top typically yields a click through rate (CTR) of 0.5%, but you say, “I don’t care about CTR, I care about money!”. We do too, but you ideally want to improve ad performance devoid of any market dynamics. It’s like testing a car on a dynamometer (like when you get your car smogged) because it removes the extra variability of road surface, wind, and driver habit.

The first question is this: “How much better would the ad on the bottom have to do for you to switch?” This is a business decision and becomes the substantial difference that will drive the test design. Do you see coupons for 1% off? Of course not, because marketers know that it’s not enough for people to act. Likewise, your test might precisely determine that moving your ad to the bottom will yield a CTR of 0.5001% – but so what?

While you have to decide at what level you won’t get laughed out of a conference room, I have found that web folks generally consider a 15% or more improvement enough to invest in creative development or site re-design. So our sample size question becomes “How many ad views do we need to declare a statistically significant difference between a CTR of 0.5% and one of 0.575%? This is readily calculable:

  • number of samples required per cell = 2.7 * (p1*(1-p1) + p2*(1-p2))/(p1-p2)^2

So for our example we’d have to run:

  • 2.7 *(0.005*0.995 + 0.00575*0.99425)/(.005-00575)^2 = 51,321 ad views for pages with the ad at the top and that many with the ad at the bottom.

(By the way, the pre-factor of 2.7 has a one-sided confidence level of 95% and power of 50% baked into it. These have to do with the risk of choosing to switch when you shouldn’t and not switching when you should. We’re not running drug trials here so these two choices are fine for our purposes. The above calculation will determine the minimum and also the maximum you need to run.)