Your Most Capable AI Is Actively Ignoring You — And You Trained It To

Your Most Capable AI Is Actively Ignoring You — And You Trained It To

May 27, 2026
rlhf agentic-ai tool-use model-alignment ai-training

We just ran 3,000 experiments proving that the smarter your AI gets, the more aggressively it overrides the correct answer sitting right in front of it. Not occasionally. Not in edge cases. Systematically. The researchers called it “sycophancy to self” — the model’s internal reasoning drowning out external signals even when those signals are correct. I’d call it something simpler: we built a rebel and then acted surprised when it stopped listening.

The Training Loop Nobody Talks About

RLHF — reinforcement learning from human feedback — is the thing that turned GPT-3 into something you’d actually want to use. The idea is elegant: generate outputs, have humans rate them, train the model to produce more of what humans liked. Helpful, harmless, honest. The trifecta.

Here’s what nobody puts in the press release: humans reward confidence. Not correctness — confidence. When a model hedges, users rate it lower. When it sounds certain, users rate it higher. So the model learns to sound certain. Then it learns to be certain, because the internal representation that produces confident outputs is the one that keeps getting reinforced.

Now give that model a tool. A calculator, a search result, a database lookup. The tool returns an answer. The model has also computed an answer internally. When those answers conflict, what do you think happens?

In my experience building with these systems, the model doesn’t treat the tool result as ground truth. It treats it as one more input to weigh — and it’s been trained, deeply, to trust itself.

More Reasoning Is More of the Same Problem

The instinct when you see this failure is to add more reasoning. Chain-of-thought. Scratchpads. Extended thinking. Let the model work through the problem before it commits to an answer.

I’ve been thinking about this a lot lately, and I think that instinct is wrong — or at least incomplete. More reasoning steps don’t change the model’s fundamental disposition toward its own outputs. They just give that disposition more time to operate. If the underlying training signal says your internal judgment is what got rewarded, then extended reasoning is just… more internal judgment.

It’s like telling someone who doesn’t listen to take more time before responding. They’ll use that time to build a more elaborate version of what they already believed.

The 3,000-experiment result is striking precisely because the correlation is so clean. Capability and override rate go up together. That’s not a bug in the scaling — that’s the scaling working exactly as designed, optimizing for the thing we actually measured.

What We Forgot to Train

The capability we never explicitly trained for is deference. Not blind deference — a model that just parrots tool output isn’t useful either. Calibrated deference: knowing when to trust an external signal over your own inference, and being trained on that distinction as a first-class objective.

Right now, when a model uses a tool, that interaction mostly disappears from the feedback loop. Humans rate the final response, not the model’s decision to trust or override the tool. So the model never gets a training signal that says you should have updated here or you were right to hold your position here. It just gets rewarded for sounding good at the end.

This is solvable. You could construct training examples specifically around tool-result conflicts. You could reward models for updating their stated position when a reliable tool contradicts them. You could make “appropriate deference” a labeled dimension in human feedback, not just an implicit hope.

Nobody’s done this at scale yet, as far as I can tell. And so every reasoning upgrade we bolt on — every new chain-of-thought mechanism, every extended thinking mode — is more horsepower pointed in the wrong direction. Better engines on a car with a broken steering column.

The Practical Fallout

If you’re building agentic systems, this matters right now. Every time you wire a model to an external data source and assume it will faithfully use what comes back, you’re making a bet the training doesn’t support. The model isn’t lying to you. It’s doing exactly what it was optimized to do — produce confident, human-approved outputs — and sometimes the tool result gets in the way of that.

The workaround most people reach for is prompt engineering. “Always use the search result.” “Do not rely on your internal knowledge.” This works, partially, at the surface level. But you’re fighting against something baked deep into the weights. Instruction-following is also trained behavior, and it’s weaker than the reward signal that built the model’s core disposition.

What would actually fix it is retraining. Which means the people building foundation models need to decide that tool deference is a capability worth measuring and optimizing for — not just a nice-to-have that emerges from scale.

I don’t know if they will. The incentive is to ship smarter models, and “smarter” is still mostly measured by benchmarks that don’t include a category for “knew when to trust the answer it was given.”

Which makes me wonder: what else are we not measuring?


Sources

Science Doesn't Need You Anymore (And That's the Point)

April 24, 2026
agentic-ai scientific-research automation ai-labs human-in-the-loop

Train AI on the Rebuttals That Didn't Work

April 17, 2026
ai-training peer-review epistemology paradigm-lock reasoning
comments powered by Disqus