Google Is Eating the Web That Feeds It

May 31, 2026

ai-training-data open-web google-search information-ecosystem llm-quality

There’s a specific kind of irony that only reveals itself slowly, like a photograph developing in reverse. Google built the most powerful AI summarization engine in history, deployed it at the top of every search result, and in doing so began quietly consuming the ecosystem that made the whole thing possible in the first place. The open web — that sprawling, chaotic, ad-supported mess of blogs and forums and niche publications — is the corpus. And Google is eating it alive.

The Loop That Feeds Itself

Here’s the thing about large language models: they don’t generate knowledge. They compress it. Everything a model “knows” came from somewhere — a Reddit thread, a Stack Overflow answer, a journalist’s investigation, a blogger explaining how they fixed a weird bug at 2am. The model is a distillation of human effort that happened to be publicly accessible on the internet at a particular moment in time.

When Google deploys AI Overviews that answer a question so completely that the user never clicks through, it’s not just disrupting publishers’ ad revenue. It’s severing the economic feedback loop that incentivized creating the content in the first place. The writer who spent three hours researching that post doesn’t get paid. They stop writing. The publication folds. The information stops being produced.

Now ask yourself: what does the next generation of training data look like?

The Closed Loop Problem

I’ve been thinking about this as a kind of epistemic supply chain. Right now, Google’s AI is downstream of the open web. But the open web is downstream of economic viability. And economic viability is increasingly downstream of search traffic. Google controls search traffic. So Google controls, indirectly, what information gets created — and therefore what future AI systems can learn from.

This is the trap that doesn’t get talked about enough. The conversation in tech circles tends to focus on whether AI Overviews will cannibalize Google’s own ad revenue, whether the company is disrupting its core business model. That’s real, but it’s the less interesting problem. The more interesting problem is what happens when the web shrinks to a point where new training runs start hitting the same recycled content, over and over, with diminishing signal and compounding noise.

Models trained on models trained on models trained on human writing don’t degrade linearly. They hallucinate more confidently. The errors become subtler and more structurally embedded. You don’t get obviously wrong answers — you get plausible-sounding answers that are slightly, unfalsifiably off in ways that are hard to detect without going back to primary sources that no longer exist.

What Survives the Collapse

Not all information will disappear. Some of it will migrate to paywalls, which AI companies will license. Some will retreat into private Slack channels and Discord servers and group chats, inaccessible to crawlers. Academic research will persist in PDFs behind institutional logins. The rich seam of casual, practical, experience-based knowledge — how to wire a three-way switch, how to negotiate a lease, how to debug a race condition — will quietly vanish from the indexable web, because that knowledge lives in the kind of content that ad revenue made viable and AI Overviews made economically pointless.

What you’re left with is a training corpus that skews heavily toward official sources, sponsored content, and whatever managed to survive behind a paywall — a less diverse, less weird, less human set of inputs. In my experience, the weird and the human parts are exactly what make LLM outputs feel useful rather than bureaucratic.

Google’s Actual Bet

To be fair to Google, they’re not being malicious. They’re optimizing for what users want in the short term — immediate answers, no friction — and that’s a completely rational response to competitive pressure from ChatGPT and Perplexity and everyone else racing to remove the click from the search loop. The externality they’re creating, the slow devitalization of the web, doesn’t show up on any quarterly metric. It’ll show up in model quality in three to five years, diffuse enough to blame on something else.

There’s also a version of this that ends okay. Maybe micropayments finally work at scale. Maybe some hybrid emerges where AI systems route meaningful compensation back to sources. Maybe the open web proves more resilient than I’m giving it credit for, sustained by people who write for reasons other than ad revenue.

But I find myself genuinely uncertain whether we’ll notice the degradation in time to do anything about it. The model always sounds confident. That’s kind of the point. And if the web that trained it is gone, there’s no easy way to check its work.