Facebook-owned cloud application platform Parse announced the launch of its Parse Push Experiments earlier this month, aimed at allowing developers to incorporate A/B testing into their apps’ push notifications. Facebook data scientist John Myles White offered some best practices for developers using Parse Push Experiments, as well as a video of the tool in action, in a blog post.
The video is embedded below, and White wrote about successful A/B testing in general:
We can say that an A/B test succeeds whenever we get a precise, correct answer to the question that originally motivated us to run the test. In other words, a good platform for A/B testing should try to prevent two kinds of failure:
- We should rarely get a result that leaves the answer to our question in doubt.
- We should rarely get an answer that seems precise, but is actually incorrect.
Parse Push Experiments uses three strategies to prevent these two kinds of failure:
- Encourage developers to ask precise questions that can be answered unambiguously.
- Prevent developers from reaching wrong conclusions by always reporting results along with a margin of error.
- Ensure that most A/B tests will give a precise answer by suggesting the minimum number of users that must be included in an A/B test in order to reasonably expect accurate results.
And he wrote on best practices for accomplishing this:
Asking precise questions: Here’s one of the most important things you can do while running A/B tests: Commit to the metric you’re testing before gathering any data. Instead of asking questions like, “Is A better than B?,” the Push Experiments platform encourages you to ask a much more precise question: “Does A have a higher open rate than B?”
The distinction between these two questions may seem trivial, but asking the more precise question prevents a common pitfall that can occur in A/B testing. If you allow yourself to choose metrics post hoc, it’s almost always possible to find a metric that makes A look better than B. By committing upfront to using open rates as the definitive metric of success, you can rest assured that Push Experiments will produce precise answers.
Acknowledging margins of error: Once you’ve chosen the question you’d like to answer, you can start gathering data. But the data you get might not be entirely representative of the range of results you’d get if you repeated the same test multiple times. For example, you might find that A seems to be better than B in 25 percent of your tests, but that B seems to be better than A in the other 75 percent.
As such, when reporting the difference between the A and B groups (we’ll call this difference the lift), it’s important to emphasize the potential for variability in future results by supplementing the raw result with a margin of error. If you have an A/B test that has a lift of +1 percent and a margin of error that is between -1 percent and +3 percent, you should report that your A/B test’s results were inconclusive. If you simply reported a +1 percent change, your results would be misleading and might set up unrealistic expectations about the success of your push strategy in the future. By reporting a range of values that should contain the true answer to your question (this range is what a statistician would call a 95 percent confidence interval), you can help to ensure that anyone reading a report about your A/B test will not reach premature conclusions.
At Parse, we determine margins of error for open rate data using a calculation called the Agresti-Caffo method. When you’re working with push notification open rates, the Agresti-Caffo method produces much more reliable margins of error than naive methods like normal approximations.
In addition to automatically calculating margins of error using the Agresti-Caffo method, the Push Experiments platform only reports results after it’s become clear that either A offers a lift over B or that B offers a lift over A — helping to further protect you from reaching premature conclusions. Until there’s enough data to determine a clear winner, the Push Experiments dashboard will report that there’s still uncertainty on whether A or B is more successful.
Choosing the right sample size: Given that the Push Experiments platform will always report results with a margin of error, you’ll want to try to make that margin smaller in order to draw definite conclusions from more of your tests. For example, if you think that your A group will show a lift of 1 percent over your B group, you’ll want to make sure you gather enough data to ensure your margin of error will be smaller than 1 percent.
The process of picking a sample size that ensures that your margin of error will be small enough to justify a definite conclusion is called power analysis. The Push Experiments platform automatically performs a power analysis for your A/B test based on the historical open rates for your previous push notifications. To simplify the process, we provide suggested sample sizes based on the assumption that you’ll be trying to measure lifts at least as large as 1 percent with your A/B tests.
Only running A/B tests with carefully chosen sample sizes makes it much more likely that your A/B tests will succeed. If you select a sample size that’s much smaller than the size we suggest, you should expect that many of your A/B tests will lead to inconclusive results.
We believe the combination of precise questions, clean statistics and careful choice of sample size is essential for running a successful A/B test. You can achieve that with Parse Push Experiments, and we hope this look into the statistical methods behind our platform will help you do it.