Merging Massive Data Sets
You have a hot date lined up and your eyes are set on the perfect little black dress (LBD) to buy from a local store. Just one small caveat — it’s a bit pricey and not really in your budget. Now, you have a few options: beg your mom to chip in, bite the bullet […]
You have a hot date lined up and your eyes are set on the perfect little black dress (LBD) to buy from a local store. Just one small caveat — it’s a bit pricey and not really in your budget. Now, you have a few options: beg your mom to chip in, bite the bullet and live off of bread until your next paycheck, try finding the dress discounted online, or hope that the dress goes on sale before your date.
For these options, there are a number of relevant pieces of data: the store’s current price for the dress, the availability and current price of the dress online, the expected date of an upcoming sale, the date of your next paycheck, the chances of you getting a bonus next week, and your mom’s willingness to subsidize you.
But consider this: the information that ends up being really useful depends on when your date actually is. If it’s not until next week, you may consider shopping for a better deal online, or waiting for a sale or your bonus. If it’s tonight, you’ll either call your mom pronto or buy some butter to go with that loaf of bread. The time until your date ultimately limits your actions, and it is the relevant subset of actions that dictate which information is valuable.
While I claim no expertise in matters of the heart, I face the question about which of my data sets is important every day in order to help match marketers with their best next customers.
It’s essentially what I do every day as a data scientist at Dstillery, formerly Media6Degrees (m6d). We make sense of 10 billion digital events — including mobile usage, clicks, online purchases and Web browsing — every day collected from millions of consumers. And I am not even talking about the additional information that can be bought from dozens of data providers.
Ultimately, determining which data is relevant is contingent on the marketers’ goals and the actions taken to achieve them. What’s more important than the size or nature of data is the clear articulation of marketers’ actions based on it. Unless you have a well-defined idea of what actions you can take, you may find yourself drowning in an avalanche of useless data.
The term “big data” may suggest that bigger is better. But in fact, often times less is more, because you never need all the data — you just need data that is meaningful for the specific goal. The advantage of big data is the fact that you can collect it all, store it and use it for a given task when the specific data is relevant.
The same overarching principle of collect/store/use also applies to merging and preprocessing data. Neither can be done appropriately until we know what we are going to do with the data. So again, it is better to first understand when the hot date is and what your possible actions are before you go at length to clean and process all the data you can possibly get.
In fact, even data with missing fields can be useful in a particular context while harmful in others. For example, while it would be permissible to replace a missing age field with a zero and add a flag for our predictive models to estimate a buyer’s interest in a product, it would be egregious to do this if all you wanted to calculate was the buyer’s average age.
The same is true when we magnify the same situation across greater datasets. For example, imagine in addition to the entire browsing history from the time you visited the site with the LBD, we also had information on the following:
- Your IP address
- The exact location of the store you visited to try on the LBD
- The app you browsed through to search for a better deal on the LBD
- Device identifier
How can these datasets be merged while maintaining the integrity of the data and providing meaningful insights to the marketer? The first question we must ask is, “What or who are we trying to analyze?” Is it the customer, the product, an account, or an interaction event with the customer? Next, what actionable insight do we want to gain through this analysis? Knowing if your data is right starts with knowing what you want to get out of your analysis.
For example, if the goal is to serve a coupon through a mobile display ad, then only three of the four data points above will be relevant. But, if the marketer wants to run a hyperlocal campaign, adding information about the exact location (what other stores are in the vicinity) is key. If all I have is the IP address, I might aggregate the local information into a DMA level that I can match to the IP address.
So how do we specifically do it? Initially we just store all events with their ‘native’ set of information. For specific tasks — say, serving display ads to the consumer (represented by a cookie), who is the most likely prospect for a product — we will annotate each relevant event with the browsing history for its cookie ID. So in essence, we merge the datasets’ browsing and purchasing information on the fly in the cookie ID. For targeting by location, all the event information is appended on the fly.
While I won’t comment on the advisability of the date in question, or opine on matters of the heart, the one piece of advice I will give is that you have to get to the heart of the matter of what actions you are going to take and make that the guiding principle in selecting, merging and preparing your big data.