Synthetic data: More than just make-believe

Synthetic data may not be real data, but there may be some important, real world, digital marketing use cases for it.

Chat with MarTechBot

Digital marketers work with real data all the time. What the online shopper does tells you a lot about what they want. But as we know, you need to be very careful about personally identifiable information (PII).

You can anonymize online shoppers by taking their names off their records, before analyzing the data. Or you can use an algorithm to synthesize observed online behavior, and use that “synthetic data” for your analysis.

That may seem like overkill. Why go through this effort when you have real data at your fingertips? Synthetic data will not be a replacement for real data, but it does have some specific use cases that a digital marketer may find useful.

Synthetic privacy in the real world

“The use case is important,” said Cem Dilmegani, founder of AIMultiple, an AI consultancy based in Germany. In this case, data privacy laws, like Europe’s GDPR, become a factor.

“As a marketer, you need data to run experiments and optimize pricing. This data includes personal data as well.” he said. “Personal data cannot be stored at a certain level.” Synthetic data should bypass the privacy issue by allowing digital marketers to simulate campaigns and outcomes.

“Synthetic data has some limitations when it comes to the representation of real customers and their actual behaviours,” said Maciej Pondel, a researcher and machine learning specialist at Unity Group, a digital commerce firm based in Wroclaw, Poland. “Nevertheless, in situations where there are some restrictions in regular data acquisition possibilities (e.g. GDPR compliance or the limited size of datasets), synthetic data can constitute an excellent representation.”

“[I]n most cases, the anonymization of real data seems better. Anonymized data includes all patterns pulled directly from reality,” Pondel added. “However, when there are sporadic cases or outliers in our data…traditional anonymization methods fail. Even if anonymization provides formal GDPR compliance, companies that don’t use synthetic data to protect outliers can lose their positive image if such anonymized data leaks out.”

Reality is messy — in a good way

Still, there are advantages to working with the real thing.” With real world data, an analyst can “tease out the nuances and hidden patterns not revealed by other techniques,” said Steven Ramirez, CEO of Beyond the Arc, a San Francisco Bay Area firm specializing in CX, strategic communications and data science.  Using an algorithm to synthesize the same data, however,  “can introduce a fatal flaw” in identifying those patterns of activity, he said.

Predictive modeling relies on multiple data sources, as well as groups of models, Ramirez said. “There is an opportunity to use synthetic data to extend data sets and provide more data where it is sparse.” It is up to the analyst to understand the integrity of each data source.

“[S]ynthetic data will never be as accurate as real data,” Pondel said. “Even if generated based on real patterns, synthetic data always misses the essential ‘reality factor’, which only makes it useful in a limited number of business cases.”

“You magnify problems getting further away from source data,” Dilmegani said. Most algorithms will replicate the distribution in the source data. “Mistakes are replicated in the synthetic data as well.”

Mind the synthetic gap

Machine learning  is very data hungry, Dilmegani pointed out. Some need may emerge for data marketers to purchase synthetic data in order to have enough data train an AI application. “This will drive the demand for synthetic data.” Dilmegani said.

For example, one application for synthetic data might be to train the AI that will operate a self-driving car. Synthetic data has also been used for the deep-learning applications needed for image processing, Dilmegani noted, a technique that has been around for almost a decade.

“I am skeptical about the uses of synthetic data.” Ramirez countered. “If you are building a machine learning/artificial intelligence model, it is not a good fit.” This goes to the heart of machine learning as it relates to artificial intelligence. About 60 to 80% of the work building an AI model is spent acquiring and preparing the data, Ramirez explained. Indded, this process “is the work.”

“The approach is to apply an algorithm or process to be able to create new data points,” Ramirez continued. “Synthetic data is produced by a process that is also subject to bias. Usually, we think of data as the ultimate source of truth…Often, we talk about letting the data speak,” Ramirez said. If the data is manufactured, then what is it saying?

“The smart application of synthetic data in training AI models can also exclude any bias that could be generated from AI models trained on real data.” Pondel said. “Regarding accuracy, in my opinion, synthetic data can be comparable to real data in a few cases.”

Synthetic prediction

Applying synthetic data to digital marketing is going to be an evolution, not a revolution. Applications will be narrow and need-driven. It will become another tool in the toolbox. “At the moment, I recognize simulations and model testing/verification as the most promising area of synthetic data applications.” Pondel said.

Machine learning is data intensive, so the demand for data may drive the use of synthetic data, added Dilmegani. Like many things in machine learning and AI, synthetic data will evolve, Ramirez said. As use cases narrow, digital marketers will get a better sense of when synthetic data is a good fit, and when it is not, he said.


Contributing authors are invited to create content for MarTech and are chosen for their expertise and contribution to the martech community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. The opinions they express are their own.


About the author

Marketing Technology
Contributor
Martech is a conference for the growing community of senior-level, hybrid professionals who are both marketing-savvy and tech-savvy: marketing technologists, creative technologists, growth hackers, data scientists, and digital strategists.

Fuel up with free marketing insights.