What could disrupt the future of generative AI?
Believe it or not, genAI actually relies on humans to be able to do its work. And through regulatory measures, including those designed to protect content creators' rights, humans may throw a spanner in its works.
There’s a lot of talk these days about how generative AI could put people out of work. Not as much thought is given to how people could put generative AI out of work. But they could — and quite possibly will.
GenAI and the foundation models on which it rests are currently at the dizzying peak of the Gartner hype cycle. If Gartner’s model is sound, those tools may be about to plunge into the “trough of disillusionment” before emerging a few years hence on a plateau of useful productivity.
There’s an argument, however, that the trough of disillusionment could swallow genAI products for good. In addition to the risks embedded in relying on what is essentially unconscious and amoral “intelligence,” users also face the very real prospects that copyright and privacy issues could mortally wound large language models (LLMs) like ChatGPT.
Let’s take those in order.
A national Do Not Scrape register?
Publishers monetize content. They do not seek to have third-parties monetize that content without permission, especially as the publishers have likely already paid for it. Professional authors monetize what they write. They too do not seek to have third-parties profit from their work with no recompense for the creator. Everything I say here about written content applies equally to graphic, video and any other creative content.
We do have copyright laws, of course, that protect publishers and authors from direct theft. Those don’t help with genAI because it crawls so many sources that the ultimate output may not closely resemble just one of the individual sources (although that can happen).
Right now, publishers are actively looking at ways to block LLMs from scraping their content. It’s a tough technical challenge
In this video, MarTech contributor Greg Krehbiel discusses ways publishers might try to block LLMs. He also makes a case for changing terms and conditions to prepare the grounds for future lawsuits. As he seems to acknowledge, none of his suggestions are a slam dunk. For instance, is it practicable to stop Google crawling your site to grab content without also stopping it crawling your site to place it in search results? Also, lawsuits are costly.
But how about a regulatory fix? Do you remember the endless annoyance of telemarketing calls? The National Do Not Call register put a stop to that. Everyone who cared was able to register their number and telemarketers could continue to call it only at the risk of the FTC imposing hefty fines.
Registering domains with a National Do Not Scrape register might be a heavier lift, but one can see in general terms how such a regulatory strategy might work. Would every infringement be detected? Surely not. But the same goes, for example, for GDPR. GDPR commands compliance not because every infringement is detected, but because those infringements that are detected can result in heavy sanctions — “unprecedentedly steep fines of up to 4 percent of a company’s total global revenue.”
It’s too late. GenAI has the data already
Whether there’s a technical or regulatory fix to stop genAI stealing content, hasn’t that horse already departed the stable? LLMs have already been trained on inconceivably large datasets. They may be prone to error, but there’s a sense in which they know everything.
Well, they know everything up to a couple of years ago. ChatGPT-4 was pre-trained on data with a cut-off of September 2021. That means that there’s a lot that it doesn’t know. Let’s remind ourselves of what we’re dealing with here.
Dig deeper: Artificial Intelligence: A beginner’s guide
GenAI uses algorithms to predict the next-best-piece-of-text to create, based on all those millions of pieces of text on which it was trained. What makes it “intelligent” is that it can improve its own algorithms based on feedback and response (a human doesn’t have to tinker with the algorithms, although of course she could).
What genAI doesn’t do — can’t do — is find out stuff about the world that lies outside its data training set. This underlines the point, made by philosophers like Donald Davidson,1 that AI has no causal connections with the world. If I want to know if it’s raining, I don’t rely on a dataset; I look out the window. To put it technically, genAI may have great syntax (grammar), but it’s a stranger to semantics (meaning).
The conclusion to be drawn from this is that AI is wholly reliant on creatures, like us, who are causally connected to the world; who can tell if it’s raining, if there’s a moon in the sky, if Jefferson drafted the Declaration of Independence. So far, it has been dependent on what people have done in the past. To remain relevant it must continue to depend on what people alone can do.
If the ability of LLMs to continue to scrape content created by humans is significantly retarded, they will not be able to add to, update, correct and augment their datasets going forward. The demise of their utility might be slow, but it would be more or less guaranteed.
Hands off my PII!
In addition to the urge of publishers, authors and other creators to keep genAI away from their content, there’s another very real problem it faces in the immediate future. The need to somehow guarantee that, in the act of scraping millions of gigabytes of data from the web, they are not inadvertently seizing personally identifying information (PII) or other types of data protected by existing regulations.
- The FTC opened a probe into OpenAI over consumer protection issues.
- Italy, as was widely reported, simply banned OpenAI and ChatGPT over the handling of personal data as well as the absence of age verification controls. Operations were restored after the Italian demands were complied with.
- European challenges are by no means over. A sweeping complaint filed in Poland claims that OpenAI is in “systematic breach” of GDPR.
Suffice to say that European courts tend to be more sympathetic to citizens’ rights than to big tech’s profits.
We haven’t even mentioned trust and safety. Those concerns were covered in my recent conversation with Gartner’s AI hype cycle expert Afraz Jaffri, who said:
The first issue is actually the trust aspect. Regardless of external regulations, there’s still a fundamental feel that it’s very hard to control the models’ outputs and to guarantee the outputs are actually correct. That’s a big obstacle.What does the future hold for genAI? The Gartner Hype Cycle
Will all this trigger the off switch?
It’s easy to say that genAI is here to stay. Plenty of people have said it. And indeed, a significant — if not entirely novel — development in technology is highly unlikely to be forgotten or abandoned. At a bare minimum, organizations will continue to use these capabilities on their own datasets, or cautiously determined external datasets, and that will meet many important use cases.
Nevertheless, the chances that genAI will be disrupted, constrained and very much altered by some combination of regulatory blocks, legal challenges, trust issues — and other obstacles as yet unseen — are well above zero.
- Donald Davison, “Turing’s Test”, Mind 59 (1950) ↩︎