The most expensive part of running a long-context transformer isn’t the attention — it’s that we built a memory system where the thing that computes and the thing that persists are the same mechanism, which is exactly backwards from how the only proven long-context memory system works.
I’ve been thinking about this a lot lately, partly because the costs of running long-context inference are genuinely surprising until you look at what’s actually happening. The KV-cache holds key-value projections for every token in the context window. It grows linearly with context length, sits in high-bandwidth memory, and gets read on every forward pass. It’s not a bug — it’s the design. But the design has a hidden assumption baked into it that I think is worth pulling apart.
The Assumption Nobody Talks About
When you extend a transformer’s context, you’re essentially asking the same circuit that does reasoning to also do persistence. The attention mechanism computes over the cache, but the cache itself is just… a flat list of slots. There’s no structure to it. No prioritization. No mechanism that says “this part is stable, stop touching it.” Every token is equally expensive to maintain and equally expensive to attend over.
Compare that to how biological memory actually works — and I mean the physical substrate, not the folk-psychology version. Synaptic weights change slowly, over hours and days, through protein synthesis. But working memory, the stuff that holds context across a conversation or a task, lives somewhere else entirely. Astrocytes regulate extracellular ion concentrations to modulate neuronal excitability. Dendritic spines change their geometry to temporarily strengthen specific pathways. There are maintenance processes running in parallel with computation that are specifically not the computation.
The brain decoupled persistence from processing. The KV-cache didn’t.
Why That Inversion Matters
In biological systems, the thing that persists context doesn’t have to be fast — it just has to be stable. The thing that computes has to be fast, but it doesn’t have to remember everything. These are different requirements, and evolution solved them with different hardware.
What we built instead is a single unified buffer that has to be both. It needs to be fast enough to serve the attention mechanism on every forward pass, which means it has to live in expensive memory. And because it’s the only persistence mechanism, it has to hold everything — you can’t evict tokens the way a real memory system might consolidate and compress. The result is that scaling context length means scaling cost in the worst possible way: linearly, uniformly, without any of the structure that would let you do something smarter.
There are people working on this. Sliding window attention, memory-augmented transformers, various retrieval-augmented approaches — these are all gestures toward decoupling. But most of them are still operating within the same basic frame: the context window is the memory, and the question is just how big to make it or how to select what goes in it.
The Inversion We Haven’t Tried
What I find myself wanting is something more structural. Not “retrieve relevant context before the forward pass” — that’s still treating retrieval as a preprocessing step for the same flat mechanism. What I’m describing is a memory system where persistence is handled by a different process entirely, one that runs asynchronously, that applies different retention logic to different parts of the context, and that serves the computation layer something more like a compressed, structured summary than a raw token sequence.
This is roughly what sleep does, by the way. Offline consolidation. The brain takes the day’s inputs and restructures them into something more durable and more queryable, and it does this while the expensive computation machinery is mostly offline. The KV-cache never gets to sleep. It just accumulates.
In my experience, the most expensive architectural mistakes aren’t the ones where you built the wrong thing — they’re the ones where you built the right thing with the wrong assumptions about which parts need to be the same. The KV-cache is fast and it works and it scales reasonably well to a point. But it’s bottlenecked by the assumption that memory and computation should be the same kind of thing, running at the same speed, on the same hardware, with the same access patterns.
The brain had millions of years to discover that this assumption was wrong. It’s not obvious that we need to rediscover it by scaling.
The Open Question
I don’t think the answer is just “add retrieval” or “compress the KV-cache better.” Those are optimizations within the existing frame. The more interesting question is whether there’s an architecture where something other than the attention mechanism is responsible for deciding what persists and how — something that runs on a different schedule, with different objectives, and doesn’t have to be consulted on every forward pass.
What would a transformer look like if memory consolidation were a first-class operation, not an afterthought bolted onto the context window?
I genuinely don’t know. But I think that’s the right question to be sitting with.
Sources
- Astrocytic resource diffusion stabilizes persistent activity in neural fields — arXiv · Quantitative Biology
- Overcoming Long-term Catastrophic Forgetting through Adversarial Neural Pruning and Synaptic Consolidation — arXiv
- Robust short-term memory without synaptic learning — arXiv
- Synaptic potentiation facilitates memory-like attractor dynamics in cultured in vitro hippocampal networks — arXiv
- On-device Synaptic Memory Consolidation using Fowler-Nordheim Quantum-tunneling — arXiv
- Metabolic constraints on synaptic learning and memory — arXiv
