You're Selecting for the Wrong Thing: How AI Benchmarks Breed Sterile Lineages

You're Selecting for the Wrong Thing: How AI Benchmarks Breed Sterile Lineages

May 1, 2026
ai-benchmarks evolutionary-selection model-architecture capability-research

The best-performing AI agent today is probably the one least likely to matter tomorrow — and we have the evolutionary data to prove it.

The Leaderboard Trap

I’ve been thinking about this problem through the lens of evolutionary biology, which sounds like a stretch until you realize it isn’t. When you select hard for a single trait — milk yield in cattle, docility in foxes, benchmark scores in transformer models — you get exactly what you asked for. You also get a cascade of hidden tradeoffs that only reveal themselves when the environment changes.

We’ve been doing this with AI systems for years now. The pipeline is familiar: pick a benchmark suite, train toward it, publish results, ship product. MMLU, HumanEval, MATH, whatever the flavor of the month is. The models that win these competitions attract capital, talent, and deployment infrastructure. The ones that don’t, get deprecated or pivoted or quietly abandoned.

This is selection pressure. And like all selection pressure, it shapes the population — not just the winners.

What Gets Bred Out

Here’s what concerns me: the capabilities that tend to score poorly on current benchmarks are often precisely the capabilities that matter for genuine future usefulness. Flexible reasoning under novel constraint. Transfer across domain boundaries that weren’t present in training. Graceful degradation when the task specification is incomplete or contradictory.

These are hard to measure. So we don’t measure them. So we don’t select for them. So they atrophy.

The architectural and training choices that produce a model that crushes MATH or writes cleaner Python than you do — those choices come with topology. They lock in certain representational structures. And some of those structures are great for the thing you optimized for and actively hostile to the thing you’ll need next.

In evolutionary terms, you’ve bred a cheetah. Fastest land animal, remarkable specialization, and genuinely fragile. Narrow habitat range. Low genetic diversity within the lineage. One significant environmental shift and the whole clade is in trouble.

The Clade Metaproductivity Problem

The Huxley-Gödel Machine framework introduced something called Clade Metaproductivity as a corrective metric — and I think it’s the first serious attempt to formalize what most thoughtful AI researchers already sense but can’t easily argue about in public because there’s no number attached to it.

The idea is roughly this: instead of measuring what a system does now, measure the system’s capacity to generate descendants that can do things the parent system couldn’t. Biological clades that are metaproductive don’t just survive environmental shifts — they radiate. They throw off new forms. They occupy niches that didn’t exist when they first appeared.

Applied to AI architectures, the question stops being “what’s the MMLU score?” and becomes “what’s the topological plasticity of this thing? If we fine-tune it, extend it, compose it with other systems, or drop it into a domain it’s never seen — does it adapt, or does it shatter?”

Current benchmarking infrastructure has essentially zero coverage of this. We’re measuring sprint speed and ignoring everything about long-term metabolic fitness.

In My Experience

In my experience building and deploying systems that use these models, the brittleness shows up in the most predictable and frustrating ways. A model that ranks highly on code generation will confidently produce syntactically perfect nonsense when the problem requires a reasoning step slightly outside its training distribution. A model tuned for instruction-following will lose coherence over long contexts in ways that feel eerily like an organism that’s been over-specialized — capable of one thing, unable to generalize.

The engineers working with these systems know this. The people setting research roadmaps know this. But the incentive gradient points toward the leaderboard, and the leaderboard is what moves funding, so here we are.

The Sterile Lineage Problem

What worries me most is the compounding effect. When a benchmark-optimized model becomes the dominant architecture — when it’s what gets deployed, what gets fine-tuned, what future models are distilled from — it seeds the next generation. Its limitations become structural. Its tradeoffs propagate. You end up with a lineage that’s extraordinarily capable at a narrowing set of things and increasingly brittle everywhere else.

This is what evolutionary biologists call a sterile lineage. Not extinct. Just… done. No further radiation. No new forms. A local maximum that turned into a permanent address.

The thing that would break this cycle isn’t better benchmarks, exactly — though that would help. It’s a shift in what the field decides to value. Clade Metaproductivity, or something like it, needs to become a first-class metric that investors ask about and researchers compete on. Right now it’s a theoretical framework that a small number of people think is important.

I don’t know if the current generation of model developers will make that shift before the sterile lineage problem becomes obvious in hindsight. But I suspect that whoever figures out how to build for topological plasticity — rather than benchmark performance — will look, ten years from now, the way Google looked in 2002. Not the fastest thing in the room. The one that could become anything.

What would it even mean to train a model for the benchmarks that don’t exist yet?


Sources

comments powered by Disqus