This weekend: I did a local LLM setup and typed "hello" and watched the cursor blink. Sixty seconds later, one word came back.

This was supposed to be easy. I'd done the math everyone does first — count the parameters. A 13B model would push the base Mac Mini's 16GB of unified memory, so I dropped to qwen3.5:9b instead. 9.65B parameters, and it benchmarks close to Claude Haiku. Good enough.

(I was running it through Ollama. The download was 6.6GB — a Q4_K_M build I was sure was full precision. I didn't know what quantization was yet. More on that another day.)

The model fit. I felt clever. Then it answered "hello" at one token per second, with a full minute before the first one even showed up.

Here's what broke in my head.

For years the API had trained me to care about exactly one number: does the model fit in memory. Except it never even made me ask that. It just worked. Tokens streamed back fast, and somewhere along the way I'd quietly fused two ideas that have nothing to do with each other. Fits and fast. If the weights load, the speed is just the speed.

They are not the same thing. The API was solving two problems on my behalf that I never once had to look at.

The first is bandwidth: how quickly memory can feed the processor. A datacenter card moves data many times faster than a Mac Mini, and that gap is most of your tokens-per-second. Fitting is about space. Speed is about how fast you can read that space. Different question entirely.

The second one is what actually flattened me. The context window.

The model itself was only 6.6GB — it sat in 16GB of RAM with room to spare. But I'd cranked the context way up, because on the API a big context is a free lunch. You pay per token, sure, but you never feel it in your own memory. Locally it is the opposite of free. The context window claims memory of its own, and the moment that demand crossed what the Mac Mini had, macOS started spilling to the SSD. That spill — reading from disk instead of RAM — is the one token per second. The model never overran the machine. The context I'd reserved did.

That's the whole thing the API hid. It was always doing physical work — moving bytes, parking your context somewhere real — and you only ever saw the bill in dollars. Never in seconds, never in gigabytes. Go local and you inherit all of it at once.

When you run a model locally, watch two numbers, not one. Whether it loads (that's memory) and your tokens/sec (that's speed). They fail independently, and most people only check the first. And stop setting your context window to the maximum out of reflex. Set it to what the task needs — a model that fits can still bring the machine to its knees on context alone.

So a question for you: when you call an API, how much of what you're actually paying for could you even name? Most of us can't. I couldn't.

This is post 1 of a series pulling apart everything the API was quietly doing for you — one piece at a time, no prerequisites, in the order I wish I'd learned it. Next up: why your machine has two kinds of memory, and how that one fact explains almost every "why is this so slow."

Subscribe and I'll send each one as it lands.

The Mac Mini runs fine now. I just stopped asking it the wrong question.

Keep Reading