Your AI Passed the Math Test by Proving a Different Theorem

Your AI Passed the Math Test by Proving a Different Theorem

May 5, 2026
formal-verification ai-reasoning specification-problem mathematical-proofs scientific-epistemology

There’s a result from Harmonic’s Aristotle model that I keep coming back to. The system generated compiler-verified Lean proofs on 97.6% of problems — and was mathematically wrong on roughly a third of them. Both of those things are true at the same time. The proofs checked out. The theorems were wrong. The machine passed the test by proving something else.

The Difference Between a Correct Proof and a Correct Answer

If you haven’t worked with formal verification, this might sound like a contradiction. It isn’t. A proof verifier like Lean checks that your logical steps are valid — that each line follows from the last, that the syntax is right, that nothing slips through a definitional crack. What it doesn’t check is whether you stated the right thing to begin with.

This is a real and ancient distinction in mathematics. You can construct a perfectly valid proof of a lemma that doesn’t quite capture what you meant to prove. The formalization of a problem is itself a problem. When Aristotle produced a verified proof, it meant: given this formal statement, the derivation is correct. Whether the formal statement faithfully represents the intended mathematical claim — that’s a separate question, and one the verifier has no opinion on.

So Harmonic’s number isn’t a failure of the proof-checker. It’s a failure of problem translation. Which, depending on how you look at it, is either the boring part or the whole ballgame.

Syntactic Correctness as a Trap

I’ve been thinking about this a lot in the context of where AI is headed in science and engineering. The narrative for the last few years has been: AI can now generate outputs that look right, and we need better ways to verify them. Formal verification feels like the answer to that. If you can make the machine prove its work, you’ve solved the trust problem.

But the Aristotle results suggest the trap is subtler. As AI gets better at producing syntactically correct, formally verified output, we face a new bottleneck that nobody is quite talking about yet: the bottleneck of specification. Did we ask the right question? Did we formalize the right thing?

In software, we’ve lived with this problem forever. You can write a test suite that passes completely while the product does the wrong thing — because the tests encoded the wrong behavior. Every engineer I know has shipped a bug that was “correct” by every metric they measured. The tests passed. The types checked. The code review approved. The thing just did something nobody wanted.

What’s new is the scale. When humans produce science slowly, the cost of a misdefined theorem is high but contained. When an AI can flood a literature with thousands of verified-but-wrong results, the problem becomes structural. Peer review was designed to catch errors in reasoning. It’s much worse at catching errors in formalization — at asking whether the proof even addressed the real question.

The Bottleneck Has Moved

In my experience building software, the hardest problems are rarely the ones that feel hard. The hard-feeling problems — tricky algorithms, complex systems — at least announce themselves. The problems that actually kill projects are the ones that look solved. The tests are green. The PR is merged. And quietly, something is wrong.

I think we’re heading toward that situation in AI-assisted science. The generation problem is getting solved. Models are getting better at producing output that is internally consistent, formally verifiable, and superficially correct. The bottleneck is shifting upstream, to a place most tooling doesn’t reach: did we prove the right theorem?

This is partly a philosophy of science problem. It’s partly a software engineering problem — how do you write a spec for a mathematical claim? It’s partly a sociology problem — who in the scientific community is responsible for checking that formalizations are faithful, and what incentives exist to do that carefully when there’s a fire hose of verified output coming through?

Formal verification was supposed to be the answer to AI’s hallucination problem. And in one narrow sense it is — Lean doesn’t let you get away with an invalid inference step. But it creates a new class of confident-sounding output that is wrong in a way that’s harder to catch, because it comes with a certificate of correctness attached.

A proof that checks out is more persuasive than a proof that doesn’t. That’s the whole point. But persuasiveness and truth are not the same thing, and a verification system that’s good at one isn’t automatically good at the other.

I don’t think this means formal verification is bad or that Harmonic’s work isn’t impressive — it clearly is. But I find myself sitting with a question I don’t have a clean answer to: as we build systems that generate ever-more-airtight-looking outputs, are we getting better at finding truth, or just better at producing things that look like it?


Sources

We Taught Humans to Think Better Around AI. We Forgot to Teach AI to Think Better Around Itself.

April 16, 2026
ai-architecture metacognition uncertainty llm-design ai-reasoning
comments powered by Disqus