The C64 Runs a Transformer and You're Still Paying $25/Million Tokens

June 4, 2026

quantization model-efficiency llm-inference edge-ai scaling-laws

A 1MHz computer from 1982 just generated a token using the same math as ChatGPT. It took 60 seconds. It has no FPU, no GPU, no cloud contract — just a MOS 6510 processor, 25KB of RAM, and a floppy disk. And somehow, that one slow, broken token is the most clarifying thing I’ve seen in AI this year.

The project is called Soul Player C64. It runs a real 2-layer decoder-only transformer on a Commodore 64 — multi-head causal self-attention, RMSNorm, feed-forward residuals, all of it, hand-written in 6502 assembly. The weights are INT8. The vocabulary is 128 tokens. The context window is 20 tokens. The output is, charitably, linguistically broken.

None of that matters. What matters is that the math fits. The transformer architecture — the exact same fundamental structure underlying every serious language model in production today — ran on hardware that predates the CD-ROM. The developer had to solve for the absence of a hardware multiplier by doing every matrix operation as shift-and-add. Division is restoring long division. The softmax required a fix to the bit-shift normalization just to give the attention mechanism enough dynamic range to see across positions.

Sixty seconds per token. And it still proved the point.

We Were Optimizing the Wrong Variable

For five years, the prevailing logic was: more parameters, more intelligence. This wasn’t entirely wrong — scale unlocked real capabilities. But somewhere along the way, scale stopped being a means and became the thing itself. Bigger became synonymous with better, and the industry organized around that assumption. Infrastructure, pricing, benchmarks, hiring — all of it optimized around the parameter arms race.

The bill for that confusion is now coming due in three places simultaneously.

First: Alibaba’s Qwen3.6-27B, a dense 27-billion parameter model, just beat a 397-billion parameter Mixture-of-Experts model on rigorous agentic coding benchmarks. Not on some synthetic task — on SWE-bench Verified, SWE-bench Pro, Terminal-Bench. The 27B model matches Claude Opus on Terminal-Bench exactly (59.3%), while the 397B MoE trails at 52.5%. On SkillsBench, the gap is even more brutal: 48.2% versus 30.0%. A model fourteen times smaller wins because it was built better — cleaner architecture, hybrid attention, better data. Not more of everything. More of the right things.

Second: Microsoft Research trained a 2-billion parameter model to 1.58 bits per weight — ternary values, just {-1, 0, +1} — that fits in under 700MB and matches full-precision peers on ARC-Challenge, GSM8K, and MMLU. At 70 billion parameters, that approach yields 8.9x faster inference than FP16. Matrix multiplication, the most expensive operation in language model inference, becomes pure addition. The theoretical energy savings are 71x. PrismML took this further with their Ternary Bonsai 8B: 1.75GB on disk, outperforming full-precision models that weigh 16GB. They introduced a metric called “Intelligence Density” to capture what’s actually happening — not raw score, but score per byte.

Third: the C64. Which is really just the logical endpoint of the same argument, taken to its absurd extreme.

What $25/Million Tokens Was Actually Buying You

I’ve been thinking about the pricing math. DeepSeek V4-Pro — a 1.6-trillion parameter model with a 1-million token context window — costs roughly $1.74 per million input tokens. Claude Opus 4.7 costs $25 per million output tokens. The capability gap on the absolute hardest agentic tasks is real; Opus leads on SWE-bench Pro 64.3% to 55.4%. But for the majority of production workloads — boilerplate generation, test scaffolding, bulk context analysis — you’ve been paying a 7x premium for headroom you weren’t using.

And that’s the cloud. A 27B dense model quantized to 4-bit runs on a single RTX 4090 with 16.8GB of VRAM. An 8B model quantized to 4-bit runs on a laptop with 8GB of RAM. A 2B BitNet model runs in a browser tab via WebGPU — Google’s Gemma 4 E2B variant already does this, with an open-source Chrome extension that runs a full autonomous agent loop client-side, no API keys, no network requests, weights streaming in via HTTP Range requests layer by layer.

The infrastructure story was always going to collapse eventually. What’s surprising is that it’s collapsing from multiple directions at once — ternary quantization, dense architecture supremacy, edge deployment, and a hobbyist with a Commodore 64 all arriving at the same conclusion through different paths.

Intelligence Per Byte

In my experience, the clearest sign that an industry has been optimizing the wrong variable is when you find out something absurd was possible all along and nobody tried it. A transformer on a C64 wasn’t a priority because parameters were cheap (they weren’t) and compute was abundant (it isn’t). The assumption was that you needed scale to get anywhere useful, so nobody asked how small you could actually go.

The answer, apparently, is 25KB and a floppy disk.

That doesn’t mean small models replace frontier ones. The capability ceiling on genuinely hard reasoning tasks is real, and a 27B model isn’t writing novel distributed systems architectures yet. But the frame has shifted permanently. The question isn’t “how big does the model need to be?” It’s “what’s the minimum intelligence density required for this specific task, and where can I run it?”

I don’t know what the right number turns out to be. But I suspect it’s much smaller than what you’re currently paying for.

Sources

You Could Run a Language Model on a Bucket of Water (And That Should Bother You)

April 19, 2026

llm-inference substrate-independence compute-efficiency ai-access hardware