<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Transformer-Architecture on BRYSGO</title><link>https://www.brysgo.com/tags/transformer-architecture/</link><description>Recent content in Transformer-Architecture on BRYSGO</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Thu, 16 Apr 2026 02:33:39 +0000</lastBuildDate><atom:link href="https://www.brysgo.com/tags/transformer-architecture/index.xml" rel="self" type="application/rss+xml"/><item><title>Your KV-Cache Is Doing the Brain's Worst Job</title><link>https://www.brysgo.com/post/2026-04-16-your-kv-cache-is-doing-the-brain-s-worst-job/</link><pubDate>Thu, 16 Apr 2026 02:33:39 +0000</pubDate><guid>https://www.brysgo.com/post/2026-04-16-your-kv-cache-is-doing-the-brain-s-worst-job/</guid><description>&lt;p&gt;The most expensive part of running a long-context transformer isn&amp;rsquo;t the attention — it&amp;rsquo;s that we built a memory system where the thing that computes and the thing that persists are the same mechanism, which is exactly backwards from how the only proven long-context memory system works.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve been thinking about this a lot lately, partly because the costs of running long-context inference are genuinely surprising until you look at what&amp;rsquo;s actually happening. The KV-cache holds key-value projections for every token in the context window. It grows linearly with context length, sits in high-bandwidth memory, and gets read on every forward pass. It&amp;rsquo;s not a bug — it&amp;rsquo;s the design. But the design has a hidden assumption baked into it that I think is worth pulling apart.&lt;/p&gt;</description></item></channel></rss>