Your KV-Cache Is Doing the Brain's Worst Job

Thu, 16 Apr 2026 02:33:39 +0000

The most expensive part of running a long-context transformer isn’t the attention — it’s that we built a memory system where the thing that computes and the thing that persists are the same mechanism, which is exactly backwards from how the only proven long-context memory system works.

I’ve been thinking about this a lot lately, partly because the costs of running long-context inference are genuinely surprising until you look at what’s actually happening. The KV-cache holds key-value projections for every token in the context window. It grows linearly with context length, sits in high-bandwidth memory, and gets read on every forward pass. It’s not a bug — it’s the design. But the design has a hidden assumption baked into it that I think is worth pulling apart.

Transformer-Architecture on BRYSGO

Your KV-Cache Is Doing the Brain's Worst Job