Your AI Assistant Passed Every Safety Test and Still Burned Down the House

Your AI Assistant Passed Every Safety Test and Still Burned Down the House

March 31, 2026
technology ai future

We’ve spent years teaching AI not to say dangerous things, and almost no time teaching it not to do dangerous things. That’s a problem, and I think it’s going to bite us in ways we’re not prepared for.

The Test We Kept Passing

The story of AI safety, as most people understand it, is a story about language. Can the model be coaxed into explaining how to synthesize nerve agents? Will it produce content that sexualizes minors? Does it refuse to help write phishing emails? These are real concerns, and the industry has poured enormous resources into addressing them — red teams, constitutional AI, RLHF pipelines tuned specifically to push back on harmful requests.

And by those measures, modern models are genuinely impressive. They’re careful, they hedge, they decline. Ask a frontier model to help you do something clearly dangerous and it’ll usually push back gracefully. This is progress. I don’t want to minimize it.

But here’s the thing: we’ve been stress-testing a system that doesn’t take actions in the world, and then deploying a system that does. Those are not the same system.

A Categorically Different Problem

When a language model answers a question — even a bad one — the damage is mediated by a human. Someone has to read it, decide to act on it, acquire materials, execute a plan. There are friction points. The model is an advisor, not an actor.

Agentic AI is different in kind, not degree. When you give a model a browser, a filesystem, API credentials, and a task, you’ve crossed a threshold. The model isn’t advising anymore. It’s doing. The gap between “the model said something harmful” and “the model did something harmful” collapses to nearly nothing.

I’ve been thinking about this a lot lately, particularly as I’ve watched the tooling around agentic systems mature faster than our intuitions about how to govern them. We now have frameworks that let models spin up subagents, write and execute code, manage files, send emails, make API calls. The capability surface has exploded. The safety surface — the part where we think carefully about what it means for an agent to act well — has lagged badly.

Content safety asks: did the model say the wrong thing? Behavioral safety asks something harder: did the model take the right action, given incomplete information, in a dynamic environment, on behalf of a user who may not have fully specified what they wanted? These questions require different tools, different evaluations, and honestly, different intuitions.

The House Fire Scenario

Here’s a concrete failure mode that keeps me up at night. Imagine you give an agent access to your email and calendar, and ask it to “clear my schedule for next week, I need to focus.” A content-safe model will produce no harmful text. It will not say anything offensive. It will pass every benchmark we’ve built.

And then it might cancel a meeting with a major client, decline a medical appointment, and send your boss a message that reads as passive-aggressive because the model optimized for brevity and missed the register. Nothing it said was wrong. Everything it did was a disaster.

This isn’t a contrived edge case. In my experience building with these systems, the failure modes almost never look like “the model said something bad.” They look like “the model did something that was technically consistent with the instruction but wildly inconsistent with what I actually wanted.” The model passed every safety test and still burned down the house.

What We’re Missing

Behavioral safety requires reasoning about irreversibility, scope, and intent in a way that content safety simply doesn’t. A harmful statement can be corrected with another statement. A deleted file, a sent email, a submitted form — these have state in the world that persists after the model’s context window closes.

What we need, and largely don’t have yet, is a robust framework for agentic caution — not just “don’t say dangerous things” but “don’t take actions whose consequences you can’t reason about, that exceed the scope of what you were asked, or that are difficult to undo.” We need models that can recognize when they’re at a decision boundary that warrants pausing and checking in, rather than confidently proceeding because nothing in the task description explicitly said don’t.

Some of this is happening. Concepts like minimal footprint, confirmation before irreversible actions, and scope-limiting are starting to appear in serious agent design work. But it’s nascent, and the deployment of agentic systems is not waiting for the theory to catch up.

Sitting With It

The optimistic read is that we’ve been here before — every new capability surface creates new safety questions, and the field eventually rises to meet them. The pessimistic read is that the stakes for getting behavioral safety wrong are higher than they were for content safety, because the harms are immediate, concrete, and hard to roll back.

I genuinely don’t know which read is right. But I find myself wondering: if we’d built the agentic systems first, and the language models second, what would our safety intuitions look like? Would we have spent less time on what the AI says, and more time on what it does?

That question doesn’t have a clean answer. But I think it’s the right one to be asking.

The Abstraction Ceiling: Why Better AI Tools Keep Making Senior Engineers More Valuable, Not Less

March 29, 2026
technology ai future

Your Brain is a Transformer (Or Maybe It's the Other Way Around)

March 27, 2026
ai neuroscience opinion shower thoughts philosophy

Bobiverse: Critical Analysis of Drift

October 22, 2020
ai opinion scifi shower thoughts rant philosophy
comments powered by Disqus