How To Not Screw Up A / B Testing Your App

This is a guest post by Tim Levine, the data guru at SocialMedia.com.

How many of your app design discussions have ended with: “Okay, let’s test it”? Hopefully you didn’t answer ‘none’, because testing is the fastest way to learn. But it can also misinform business decisions if done carelessly. Instead, I want to help you be careful in the simplest and most common kind of test: The A/B test. Of course, a well-designed multi-variate test is the best use of resources, but we have to roll over before we can crawl.

The most common mistake in A/B testing is not running enough ‘trials.’ The second is running too many. The latter risks wasting potentially more productive opportunities, but the former is far worse because you risk managing by noise. How wasteful it would be to invest in a redesign because of a difference that might just be from random variation!

The table below shows some typical productive opportunities that you might present to a user. If the user pursues the opportunity, then the trial becomes a ‘success.’ Otherwise, it becomes a ‘failure’. Typical success rates and appropriate sample sizes are included as well. Note that ‘A’ would be one ‘cell’ (group of users) and ‘B’ would be another.

User has the opportunity toTypical Success rate
(successes/opportunities)
Number of trials needed per cell
(see below)
Click an ad0.5%52k impressions
Invite friends12%1871 new users
Proceed to next stage of flow40%366 users entering flow
Click Ad then purchase something0.001%2.6MM impressions

Since SocialMedia enables developers to monetize their apps through ads, let’s suppose that you were considering moving an ad from the top of a page to the bottom. In this example, the ad at the top typically yields a click through rate (CTR) of 0.5%, but you say, “I don’t care about CTR, I care about money!”. We do too, but you ideally want to improve ad performance devoid of any market dynamics. It’s like testing a car on a dynamometer (like when you get your car smogged) because it removes the extra variability of road surface, wind, and driver habit.

The first question is this: “How much better would the ad on the bottom have to do for you to switch?” This is a business decision and becomes the substantial difference that will drive the test design. Do you see coupons for 1% off? Of course not, because marketers know that it’s not enough for people to act. Likewise, your test might precisely determine that moving your ad to the bottom will yield a CTR of 0.5001% – but so what?

While you have to decide at what level you won’t get laughed out of a conference room, I have found that web folks generally consider a 15% or more improvement enough to invest in creative development or site re-design. So our sample size question becomes “How many ad views do we need to declare a statistically significant difference between a CTR of 0.5% and one of 0.575%? This is readily calculable:

  • number of samples required per cell = 2.7 * (p1*(1-p1) + p2*(1-p2))/(p1-p2)^2

So for our example we’d have to run:

  • 2.7 *(0.005*0.995 + 0.00575*0.99425)/(.005-00575)^2 = 51,321 ad views for pages with the ad at the top and that many with the ad at the bottom.

(By the way, the pre-factor of 2.7 has a one-sided confidence level of 95% and power of 50% baked into it. These have to do with the risk of choosing to switch when you shouldn’t and not switching when you should. We’re not running drug trials here so these two choices are fine for our purposes. The above calculation will determine the minimum and also the maximum you need to run.)

Next suppose you ran your two cells leveraging best practices, such as

  • Running the two cells concurrently
  • Randomly assigning an individual user to a cell and make sure they stay in that cell during the test
  • Scheduling the test to neutralize time-of-day and day-of-week effects.
  • Serving users from countries that are of interest.
  • And you get the following results:

    Test 1

    Ad positionAd viewsCTRResult -> Implied Action
    Top52,1210.52%Difference smaller than we care about -> Do nothing
    Bottom53,7820.57%

    You wouldn’t recommend action, because you decided ahead of time that this difference is too small to act on, and so the test cannot declare a signficant difference.

    However if the results had been the following:

    Test 2

    Ad positionPage views servedMeasured CTRResult -> Implied Action
    Top52,1210.52%Bottom is better -> Switch
    Bottom53,7820.60%

    You would recommend moving the ad to the bottom of the page. By the way, the fact that the CTR of the ad on top came to be 0.52% and not 0.5% as we had originally thought doesn’t matter. We just needed a baseline figure to calculate the sample size.

    As a companion to this brief post is an equally brief (2 minute) demonstration video I made to give you an intuitive sense of why getting test sample size is important. And if this wikipedia article on sample size and related articles don’t quench your thirst for knowledge, then might I suggest a master’s degree?

    SocialMedia Sample Size Demo from nick g on Vimeo.

    Happy Testing!

    (Disclaimer: SocialMedia is an advertiser on this blog.)