Watch Out For False Positives — 3 Ways To Get Better At Testing
It's easier than ever for marketers to dive into A/B and multivariate testing, but columnist Benny Blum argues that they need to know how to design a proper test first.
Everyone is testing — and you should be testing, too. If you’re not leveraging your website, CRM, and/or sales data to test and improve your business in some capacity, you’re leaving money on the table.
But, what are you testing? And do you (or should you) trust the results?
Testing software can enable A/B and multivariate testing with ease. Non-technical marketers can now quickly implement complex tests and systematically “prove” positive or negative results within a nicely designed UI.
However, one of the biggest issues keeping non-statistical results-driven marketers from implementing and interpreting tests is that they often don’t know how to design a proper test.
In this post, I’m going to detail three concepts which, if implemented, can help ensure any test you design is well-thought-out and more likely to deliver true results.
1. Design Of Experiments (DOE)
A Design of Experiments is a form of applied statistics used for planning, executing, and analyzing one or a series of controlled tests to understand the influence of one or more signals in a complex environment.
RA Fisher pioneered DOE back in the 1920s and 1930s and formally introduced, among many others, the following concepts:
- Testing against a control (A/B testing)
- Random assignment of participants between test(s) and control groups
- Repeat testing to ensure accuracy and consistency of result
A well-designed and implemented experiment increases the likelihood of variance detection (good results) and reduces the likelihood of false positives or negatives. And one of the single most components of a well-designed experiment is a large sample size.
2. Statistical Power
A small sample increases the likelihood of a false positive.
Consider the null hypothesis: dogs are bigger than cats. If I use a sample of one dog and one cat – for example, a Havanese and a Lion – I would conclude that my hypothesis is incorrect and that cats are bigger than dogs.
But, if I used a larger sample size with a wide variety of cats and dogs, the distribution of sizes would normalize, and I’d conclude that, on average, dogs are bigger than cats. Not surprisingly, one of the most common flaws in a test is having a sample that is too small.
Fortunately, there’s a test to figure out if your sample is big enough: Statistical Power is the probability that a test will register a variance from a control. The bigger the sample size, the bigger the power.
There’s some serious math behind Statistical Power, but here’s a good rule of thumb: if you think you’re test is done, test a bit longer.
Unfortunately, most testing software charges by the number of impressions monitored in a test. This naturally disincentives users to run longer tests as COGs to execute the test rise as the duration of the test extends.
If you are operating on a slim budget and need results quickly, try running an A/A test in parallel with an A/B test. If the A/A test generates the same or similar “positive result” you can assume the high likelihood of a false positive.
3. Regression To The Mean
Imagine an experiment where we ask ten people to flip a coin a hundred times and guess the result for each flip.
We would expect an evenly distributed set of results with an average score of 50 correct and 50 incorrect. We declare the participants with the top 10 scores in the experiment to be the winners and ask them to perform the experiment again.
Chances are their results in the second experiment will, again, be evenly distributed with an average of 50 correct and 50 incorrect. Did the winners of the first round suddenly get worse at guessing?
No. They were outliers in the first round and when challenged again they naturally regressed toward the average score. This phenomenon is very apparent in online tests.
More often than not, a test showcases a strong initial result due to a novelty effect rather than a better user experience. If you let the test extend a bit longer, chances are you’ll see the results regress to control.
User behavior is difficult to change and amazing results in a short period of time are more often than not false positives.
This is not indented to undermine the novelty effect of making a change – constantly switching things up can make consumers pay more attention. That said, it takes a lot of data to make a test statistically significant, so chances are you’re working with an insignificant dataset.
If you embrace that reality then you can spend a little more time to strategically design your experiments to maximize the impact of your hypothesis validation and testing.