Are your A/B tests just junk science?

Here are six lessons for marketers from the “Replication Crisis” in the sciences to keep in mind as you design your next round of A/B tests.

Chat with MarTechBot

In today’s digital-first advertising world, many marketing leaders aspire to conduct marketing as a science. We speak in scientific terms – precision, measurement, data; we hire young professionals with bachelor of science degrees in marketing; we teach our teams to test their hypotheses with structured experiments.

Yet, most marketers have no idea that the sciences are facing a methodological reckoning, as it has come to light in recent years that many published results – even in respected, peer-reviewed journals – fail to replicate when the original experiments are reproduced. This phenomenon, known as the “Replication Crisis,” is far from a niche phenomenon. Recent reports suggest that the majority of psychology studies fail to replicate, and certainly many marketers are beginning to feel that, for all the “successful” A/B tests they’ve run, high-level business metrics haven’t much improved.

How could this happen? And what can marketers learn from it? Here are six key points to keep in mind as you design your next round of A/B tests.

The meaning of ‘statistical significance’

You might be tempted to skip this section, but most marketers cannot correctly define statistical significance, so stick with me for the world’s quickest review of this critical concept. (For a more thorough introduction, see here and here.)

We begin any A/B test with the null hypothesis:

There is no performance difference between the ads I’m testing.

We then run the test and gather data, which we ultimately hope will lead us to reject the null hypothesis, and conclude instead that there is a performance difference.

Technically, the question is:

Assuming that the null hypothesis is true, and any difference in performance is due entirely to random chance, how likely am I to observe the difference that I actually see?

Calculating p-values is tricky, but the important thing to understand is: the lower the p-value, the more confidently we can reject the null hypothesis and conclude that there is a real difference between the ads we’re testing. Specifically, a p-value of 0.05 means that there is a 5 percent chance that the observed performance difference would arise due to purely random chance.

Historically, p-values of 0.05 or less were deemed “statistically significant,” but it’s critical to understand that this is just a label applied by social convention. In an era of data scarcity and no computers, this was arguably a reasonable standard, but in today’s world, it’s quite broken, for reasons we’ll soon see.

Practical advice: when considering the results of A/B tests, repeat the definition of p-value like a mantra. The concept is subtle enough that a consistent reminder is helpful.

‘Statistical significance’ does not imply ‘practical significance’

The first and most important weakness of statistical significance analysis is that, while it can help you assess whether or not there is a performance difference across your ads, it says nothing about how big or important the difference might be for practical purposes. With enough data, inconsequential differences could be considered “statistically significant.”

For example, imagine that you run an A/B test with two slightly different ads. You run 1,000,000 impressions for each ad, and you find that version A gets 1,000 engagements, whereas version B gets 1,100 engagements. Using Neil Patel’s A/B calculator (just one of many on the web), you will see that this is a “statistically significant” result – the p-value is 0.01, which is well beyond the usual 0.05 threshold. But is this result practically significant? The engagement rates are 0.1 percent and 0.11 percent respectively – an improvement, but hardly a game-changer in most marketing contexts. And remember, it took 2M impressions to reach this conclusion, which costs real money in and of itself.

My practical advice to marketing leaders is to accept the fact that slight tweaks will rarely have the dramatic impact we seek. Embrace the common-sense intuition that it usually takes meaningfully different inputs to produce practically significant outcomes. And reframe the role that testing plays in your marketing so that your team understands significance analysis as a means of comparing meaningfully different marketing ideas rather than as a definition of success.

Beware ‘publication bias’

But… what about all those articles we’ve read – and shared with our teams – that report seemingly trivial A/B tests delivering huge performance gains? “How adding a comma raised revenue by 30 percent.” “This one emoji changed my business,” etc.

While there are certainly little nuggets of performance gold to be found, they are far fewer and farther between than an internet search would lead you to believe, and the concept of “publication bias” helps explain why.

This has been a problem in the sciences too. Historically, experiments that didn’t deliver statistical significance at the p-value = 0.05 level were deemed unworthy of publication, and mostly they were simply forgotten about. This is also known as the “file drawer effect,” and it implies that for every surprising result we see published, we should assume there is a shadow inventory of similar studies that never saw the light of day.

In the marketing world, this problem is compounded by some factors: the makers of A/B testing software have strong incentives to make it seem like easy wins are just around the corner. They certainly don’t publicize the many experiments that failed to produce interesting results. And in the modern media landscape, counterintuitive results tend to get shared more frequently, creating a distribution bias as well. We don’t see, or talk about, the results of all the A/B tests run with insignificant results.

Practical advice: Remember that results that seem too good to be true, probably are. Ground yourself by asking “how many experiments did they have to run to find a result this surprising?” Don’t feel pressured to reproduce headline-worthy results; instead, stay focused on the unremarkable, but much more consequential work of testing meaningfully different strategies and looking for practically significant results – that’s where the real value will be found.

Beware ‘p-hacking’

Data is a scientifically inclined marketer’s best friend, but it should come with a warning label, because the more data dimensions you have, the more likely you are to fall into the anti-pattern known as “p-hacking” in one way or another. P-hacking is the label given to some ways that data analysis can produce seemingly “statistically significant” results from pure noise.

The most flagrant form of p-hacking is merely running an experiment over and over again until you get the desired result. Remembering that a p-value of 0.05 means that there is a 5 percent chance that the observed difference could arise by random chance, if you run the same experiment 20 times, you should expect to get one “significant” result by chance alone. If you have enough time and motivation, you can effectively guarantee a significant result at some point. Drug companies have been known to do things like this to get a drug approved by the FDA – not a good look.

Most marketing teams would never do something this dumb, but there are subtler forms of p-hacking to watch out for, many of which you or your teammates have probably committed.

For example, consider a simple Facebook A/B test. You run two different ads, targeting the same audience, simple enough. But what often happens when the high-level results prove to be unremarkable, is that we dig deeper into the data in search of more interesting findings. Perhaps if we only look at women, we’ll find a difference? Only men? What about looking at the different age bands? Or iPhone vs. Android users? Segmenting the data this way is easy, and generally considered a good practice. But the more you slice and dice the data, the more likely you are to identify spurious results, and the more extreme your p-values must be to have practical weight. This is especially true if your data analysis is exploratory (“let’s check men vs. women”) rather than hypothesis-driven (“our research shows that women tend to value this aspect of our product more than men – perhaps the results will reflect that?”). For a sense of just how bad this problem is, see the seminal article “Why Most Published Research Findings Are False” which is credited for persuasively raising an early alarm about p-hacking and publication bias.

In the sciences, this problem has been addressed by a practice called “pre-registration,” in which researchers publish their research plans, including the data analyses they expect to conduct so that consumers of their research can have confidence that the results are not synthesized in a spreadsheet. In marketing, we typically don’t publish our results, but we owe it to ourselves to apply something like these best practices.

Practical advice for marketing teams: Throw the p=0.05 threshold for “statistical significance” out entirely – a lot of p-hacking stems from people searching endlessly for a result that hits some threshold, and in any case, our decision-making should not rely on arbitrary binaries. And make sure that your data analysis is motivated by hypothesis grounded in real-world considerations.

Include the cost of the experiment in your ROI

An often-overlooked fact of life is that A/B tests are not free. They take time, energy and money to design and execute. And all too often, people fail to ask whether or not an A/B test is likely to be worth their time.

Most A/B testing focuses on creative, which is appropriate, given that ad performance is largely driven by creative. And most of what’s written on A/B testing acts as if great creative falls from the sky and all you need to do is test to determine which works best. This might be reasonable if you’re talking about Google search ads, but for a visual medium like Facebook…creative is time-consuming and expensive to produce. Especially in the video era, the cost of producing videos to test is often higher than the expected gains.

For example, let’s say you have a $25k total marketing budget, and you’re trying to decide whether to spend $2k on a single ad, or $5k on five different variant ads. If we assume that you need to spend $1k on each ad variant to test its performance as part of an A/B test, you’d need your winning ad to perform at least 20 percent better than baseline for A/B testing to be worthwhile. You can play with these assumptions using this simple spreadsheet I created.

Twenty percent may not sound like much, but anyone who’s done significant A/B testing knows that such gains aren’t easy to come by, especially if you’re operating in a relatively mature context where high-level positioning and messages are well-defined. The Harvard Business Reviews reports that the best A/B test that Bing ever ran produced a 12 percent improvement in revenue, and they run more than 10,000 experiments a year.  You may strike gold on occasion, but realistically, you’ll find incremental wins sometimes, and no improvement most of the time. If your budget is smaller, it’s that much harder to make the math work. With a $15k budget, you need a 50 percent improvement just to break even.

Practical advice: Remember that the goal is to maximize advertising ROI, not just to experiment for experiment’s sake. Run ROI calculations up front to determine what degree of improvement you would need to make your A/B testing investment worthwhile. And embrace low-cost creative solutions when testing – while there are trade-offs, it may be the only way to make the math work.

Don’t peek!

Marketers love a good dashboard, and calculations are so easy in today’s world that it’s tempting to watch our A/B test results as they develop in real-time. This, however, introduces another subtle problem which could fundamentally undermine your results.

Data is noisy, and the first data you collect will almost certainly deviate from your long-term results in one way or another. Only over time, as you gather more data, will your results gradually approach the true long-term average – this is known as “regression to the mean,” and to get meaningful results, you have to let this process play out.

If you look at the p-value on a continuous basis and declare victory as soon as your p-value exceeds a certain value, you are doomed to reach the wrong conclusions. Why? Statistical significance analysis is based on the assumption that your sample size was fixed in advance.

As March Madness heads up, a sports betting analogy might help develop intuition. Tonight, in a game that features superstar Zion Williamson’s highly anticipated return from injury, Duke will take on Syracuse in the ACC tournament. Duke is favored by 11.5 points, such that if you bet on Duke, they need to win by 12 points or more for you to win the bet. But critically, only the final score matters! If you tried to bet on a similar but different proposition, that Duke would take a 12 point lead at some point in the game… you’d get far worse odds, because as any basketball fan knows, a 12 point can come and go in the blink of an eye, especially in March.

Constantly refreshing the page to check in on your A/B tests is like confusing a “Duke to win by more than 11.5” bet with a “Duke to lead by more than 11.5 at any point in the game” bet. P-values will fluctuate up and down as data as collected, and it’s far more likely that you’ll see a low p-value at some point along the way than at the very end of the test.

Sometimes, peeking can still be warranted – in clinical trials of new medicines; for example, experiments are sometimes stopped midway if the preliminary results show great harm to one group or another. In these rare and extreme situations, it’s considered unethical to continue to test. But this is not likely to apply to your marketing A/B tests, which post very few risks, especially if well-designed.

Practical advice: Define your tests up-front, check in only periodically, and only consider stopping a test early if the results are truly extraordinary, such that continuing the test seems to have real practical costs.

Conclusion



A scientific approach to marketing undoubtedly holds incredible promise for the field. But nothing is foolproof; all too often, marketers deploy subtle scientific tools with only a superficial understanding and end up wasting a tremendous amount of time, energy, and money as a result. To avoid repeating these mistakes, and to realize the benefits of a rational approach, marketing leaders must learn from the mistakes that contributed to the Replication Crisis in the sciences.


Opinions expressed in this article are those of the guest author and not necessarily MarTech. Staff authors are listed here.


About the author

Nathan Labenz
Contributor
Nathan Labenz is the founder and CEO of Waymark, an online video maker. With Nathan at the helm, Waymark is working to bridge the gap between local advertisers and the future of video. Nathan has garnered awards for Waymark including the Google Demo Day “Game Changer” award. Nathan is a graduate of Harvard University and resides in Detroit, Michigan, where Waymark is headquartered.

Fuel for your marketing strategy.