Thinking Inside The Boxplot
In a previous post describing a simple approach to de-seasonalizing your data, I covered how marketers can examine, at a rough level, the impact of seasonality on metrics. Obviously, your data science team would be looking at this data in greater depth and, to be sure, a more precise calculation of seasonality is a far […]
In a previous post describing a simple approach to de-seasonalizing your data, I covered how marketers can examine, at a rough level, the impact of seasonality on metrics. Obviously, your data science team would be looking at this data in greater depth and, to be sure, a more precise calculation of seasonality is a far more complex undertaking. The basic thought process, however, is a valuable learning for business people.
Today, let’s try this approach on another topic: using a boxplot to get a general overview of the spread of your data without using “capital S” Statistics, and, again, at a rough level, making smarter “guesstimates” of outcomes.
What Is A Boxplot?
A boxplot looks something like this:
It has a central box with a divider inside, and two “whiskers” going out each side. We order the data and split it into four groups — quartiles — representing the bottom quarter, the top quarter, etc. The whiskers in the boxplot represents the data in the top and bottom quartiles, and the box in the boxplot represents the middle two quartiles.
A sharp reader will note that just because we’ve split the data into quartiles, that doesn’t mean each quartile is of equal width — indeed, one of the advantages of using boxplot is to get a visual sense of how unevenly your data might be spread out. At any rate, let’s work an example of calculating a boxplot:
Example: Party Goers
Imagine we’re looking at the age of people at a party, as below.
Quite a range in ages! To calculate a boxplot, use the following steps:
- Order the data. Typically, smallest first.
- Identify the biggest and smallest data values. These are the ends of the whiskers.
- Identify the median value. This is the middle value after ordering. Note this is not the average of all the values; it’s simply the middle value between the smallest and biggest. If there are two such — which will occur if you had an even number of values to start — then just average these two middle values. I put an odd number of people at this small party to keep things simple.
- Identify the “lower quartile,” which is the median value between the smallest value and the overall median you just calculated — this represents the lower 25% of the data, and visually is where the lower whisker ends and the box begins. Do the same thing on the other side for the “upper quartile”.
As you can see from the above, Billy is the youngest person at the party (22), while Karen is the oldest (67). The median age at the party is 35 (Sue’s age), which is to say half the people at the party are older than Sue, and half are younger than Sue. The lower quartile’s edge is repesented by Paul (31), meaning that, of the younger half of the party, Paul is smack in the middle of that group, and similarly, the upper quartile’s edge is represented by Roxanne (48). That took three complex sentences and a mess of words, so here’s where the boxplot of the above comes in: we can visually represent the party as:
The boxplot above is probably far easier to understand in a split second than reading the above paragraph and staring at the numbers. The whole point of a visualization is to summarize large amounts of data quickly.
There’s Billy on the left-most edge of the left whisker, and Karen all the way on the other end of the other Whisker. There’s Sue as the median with the bold line inside the box, and she’s boxed in by Paul and Roxanne the left and right sides of the box.
What else can we derive from this? Well, it looks like the spread of the data on the upper range of ages is wider than on the lower end — the younger people who are at the party are relatively closer in age than the older people at the party. That these are more tightly bunched together is also reflected by the fact that the bold line, representing the median (Sue, 35), skews to the left. Sue at 35 is closer in age to the youngest person Billy (22), than she is to the oldest person Karen (67).
Discovering Gender As A Variable
Now, an astute reader might well have noticed something: the men seem to dominate the younger crowd, whereas women make up the majority of the older crowd. Up to this point you’d have to stare at the numbers to realize that. In fact, Tom, at 38, is the only fellow who is older than the median (Sue, 35), and none of the women are in the younger half of the crowd. Goodness, it’s a Cougar Party!
Let’s use the boxplot technique to make this sort of distinction really stand out. We’ll separate the party goers into two groups, men and women. Then we’ll create a boxplot for each sex. Here are the raw numbers again, this time broken up into Men and Women:
And here are the boxplots of the men versus the women:
The disparity really stands out this way! Now what we suspected earlier — that the younger crowd was bunched together in age more tightly than the older crowd — is far more defined and obvious. It’s because the younger crowd were all men “of a certain age” range; the women’s boxplot above shows a wider range though it looks more evenly distributed. We can also see that the youngest woman at the party is barely younger than the oldest man.
If we hadn’t been one of the astute readers who noticed the trend by staring at the numbers, a boxplot like the above surely would have gained our notice. So boxplots are very useful when it comes to teasing out features that deserve attention.
Making Smart Guesses
I said at the start of this post that boxplots can also help us make guestimates. How so?
Well, suppose I told you that a new person just joined the party, fashionably late as it were. Can you tell me if it’s a man or a woman if I tell you the person’s age?
If we hadn’t discovered a gender bias in the data related to age, you might’ve been stuck for a guess — after all, men and women are about 50/50 in the population of humans as a whole, so it’d be something of a coin flip. But if I tell you that the person who just walked in to this party is 57 years old, I suspect you’d be more likely to guess that the new person is a female. If I told you the new person was 21, you might be more likely to guess a male.
That sort of tiny insight — which came from a simple boxplot — might be all the edge you need to convert a few more people to use your product or service or otherwise score a “win” at your company.
Using Boxplot Insights For Marketing
Imagine this isn’t just a casual party, as in our example, but, say, a group of people that tweet on a certain topic such as Vitamins. You could use a tool like DemographicsPro for Twitter to analyze the folks who were interested in the topic. If the folks interested in Vitamins were something like the group at the party, you’d know that the group mostly consisted of younger men and older women.
Knowing that the group of men interested in the topic is biased towards younger men, your vitamin company might well consider offers for this group that speak about strength training or muscle mass, or something similar. You aren’t likely to have a great conversion metric if you’re pushing your “golden years” or “ED” vitamin mix at them. Whereas interest in an older person — again based on the age & gender mix of the party attendee data we just analyzed — might well indicate that a vitamin offering more reflective of middle age, and geared towards women, might be fertile ground for conversion.
By the way, use of this sort of a priori knowledge is a core philosophy behind Bayesian statistics where each successive estimate of the probability of something occurring is based on what we’ve observed earlier. That is a subject too complex to handle here, however.
[Note I’m not asking you to calculate the probability for this. And I’ll agree that this small party is perhaps too small to do any meaningful statistical analysis on. But your human gut tells you not to bet on a coin flip of the next person who comes in the door, knowing the age of the person. We know something more here, even though it’s a bit more qualitative than quantitative.
I’d also point out that if it really were a random coin flip on male/female being the next person through the door, then there’s nothing particularly wrong with guessing based on what we’ve just observed about age and gender, since any guess is as good as any other when the probability of the outcomes of a random process are all the same.
P.S. I didn’t really write this section for you, dear Marketer, but rather to satisfy the numbers geek on your team who wouldn’t stop yapping until I did.]
Boxplots are an excellent way to visually look for interesting groupings of data when you’ve got multiple variables. Imagine if we had data on each party goer’s height, or income, or how often they’ve purchased our product, or any number of other things.
Boxplots are easy to set up without having to call in the hardcore techies, and the insight you gain will help you work with the data science team so you can concentrate on those features with the potential for greatest impact on your sales, metrics, or optimization efforts. They also generate interesting paths for testing.
Try one today, and see where it takes you!