Messy data is your secret weapon — if you know how to use it
Your messiest, most overlooked data could hold your most valuable insights. With LLMs and ETL, even dirty data is a goldmine.
For decades, the rule in data science was simple: clean your data or don’t bother. But that rule is starting to break. Thanks to recent advances in AI and language models, even the messiest, most neglected data sources are becoming valuable — and surprisingly easy to work with.
The reign of clean data
If you mapped the last 20 years of data management — the so-called big data movement — you’d see an explosion of analogies used to describe different collections. Chances are, your business has implemented or at least explored one or more: data lakes, data ponds, data warehouses, data marts, data hubs, data reservoirs, data vaults, data meshes or operational data stores. Hopefully, you’ve also steered clear of the dreaded data swamps, data graveyards and data silos.
Despite their differences, nearly all of these architectures share a single core idea: you want as much clean data as possible at your fingertips. You want it now, and you need it clean.
Several years ago, when I led a large data science team, we developed a set of core beliefs that guided all our work. The first was simple: “Klean is King.” We even made a poster with the tagline, “An hour of cleaning is worth a day of analysis.” That stat was made up, but it felt about right. No one’s disproved it yet.
Dig deeper: The marketer’s guide to conquering data quality issues
Enter the mess: How AI handles dirty data
But things have changed. While the two pillars of data management — architecture and cleaning — remain essential, our ability to work with unstructured and dirty data has transformed in recent years.
LLMs aren’t just for chat. (Chat is arguably the least interesting thing they can do.) Their ability to extract meaning from messy data is remarkable.
This shift fascinates me. Over the years, I’ve encountered many data sources that were far too messy for traditional analysis. Think:
- Clickstream data — millions of URLs, each with a structure that changes from site to site.
- Machine-generated log files, where every application, container and server has a cryptic format, custom timestamps and inconsistent error codes need to be parsed individually.
- Unstructured text from customer support tickets and social media feeds, filled with slang, emojis, sarcasm and typos that resist simple keyword analysis or categorization. And don’t strip those emojis — they’re dense with meaning.
- Raw telemetry from Internet of Things (IoT) sensors, constantly streaming readings from thousands of devices, often in proprietary binary formats and riddled with signal noise, connection dropouts and calibration drift.
- And that’s before we even touch the vast archives of image and video files, where the real value — like a product defect in a photo or a critical moment in a security feed — is buried deep in the pixels and requires advanced computer vision models to extract.
Dig deeper: How AI makes marketing data more accessible and actionable
Meaning over syntax: The new value layer
There’s a lot of dirty data out there — and you’re probably sitting on a ton of it. In England, there’s a saying: “Where there’s muck, there’s brass.” In American terms, where things are filthy, there’s money to be made. Nowhere is that more true than in business data.
Thanks to recent advances in language and image understanding — like function-calling APIs and strongly typed interfaces — it’s now incredibly easy to build data cleaning workflows that would’ve been unthinkable five years ago.
ETL (extract, transform, load) has become vastly more powerful. And these workflows are perfect for small, local models — free, private and capable of running millions of analyses without API costs or data exposure. Your laptop might get a bit warm, but that’s about it.
The analysis of dirty data has evolved — from parsing syntax and surface content to extracting meaning and intent. Instead of dissecting URLs to pull out string components, we can now infer what a user was trying to do:
- What they intended.
- What they hoped for.
- Why they clicked.
- Why they bounced.
- Why they bought.
Meaning and intent are where the value is. Syntax? Not so much. We’re not just unlocking new categories of data. We’re moving up the value chain to a higher semantic layer: understanding what people meant.
Your hidden goldmine: It’s time to dig deeper
A key part of your competitive advantage lies in what you know that your competitors don’t. Right now, much attention is paid to what LLMs know — but that’s knowledge anyone can access. It’s table stakes, not differentiation. The real edge comes from uncovering what only you can know.
Here’s a challenge: List every data source your company has that’s never been cleaned, explored or valued. What are the digital droppings of your business — the logs, archives, and secondary outputs that aren’t part of your core operations, but might reveal what your customers want, feel, or struggle with? These are the things your competitors can’t see.
Chances are, there’s something in that mess that could transform your business — no matter how dirty it looked before.
Dig deeper: Before scaling AI, fix your data foundations
Contributing authors are invited to create content for MarTech and are chosen for their expertise and contribution to the martech community. Our contributors work under the oversight of the editorial staff and contributions are checked for quality and relevance to our readers. MarTech is owned by Semrush. Contributor was not asked to make any direct or indirect mentions of Semrush. The opinions they express are their own.
Related stories
New on MarTech