VRAM vs RAM: the number that actually runs your local LLM

Few day's after I had the Mac Mini running models, I found the setting half the internet argues about: n_gpu_layers. On a gaming PC it's the make-or-break dial. I'd never once touched it.

That gap is where the whole thing finally clicked.

n_gpu_layers is exactly what it sounds like: a number you set when loading the model that says how many of its layers go on the GPU. On a normal PC, that's a rationing decision between two pools of memory. The fast one is VRAM: soldered onto the graphics card, where the GPU reaches it at full speed. The slow one is your system RAM, walled off behind the PCIe bus. Set the number too high and the model overflows the card. Too low and the GPU sits idle. People report throughput collapsing over a swing of ten layers.

A model runs on the GPU. So it has to sit in memory the GPU can reach fast. On a PC that's VRAM, and VRAM is small: 8GB on a normal card, 24GB if you really paid up. Your 32GB of system RAM might as well be in another building. The model lives or dies on the GPU's number, not the one on the box.

So why had I never touched the dial?

Because the Mac doesn't have two pools. Unified memory means the CPU and GPU share one block of physical RAM — no PCIe bus to cross, no card to overflow. The rationing decision the PC crowd agonizes over doesn't exist here. There's nothing to split.

That's the part almost nobody coming from a spec sheet understands. A 16GB Mac Mini can hand most of that memory straight to the model. A 16GB gaming PC can't — the model's stuck with whatever the GPU has, maybe 8GB of it. Same RAM number on the box. Completely different ceiling.

Though "most of" is doing real work in that sentence. macOS reserves a slice of unified memory for the system and caps what the GPU can take. On my 16GB machine that's roughly 12GB for the model, not the full 16. The number on the box lies even on the Mac — just less.

(And the context window? Its memory, the KV cache, has to live in that same fast pool, next to the weights. Which is exactly how I brought the machine to its knees last time. That cache earns its own post.)

People still shop for RAM. I get it. It's the number that's mattered for years, for everything else you run. For local models it's the wrong line on the spec sheet, and unified memory quietly changed what the right one even means.

Before you buy or rent anything to run a model, skip the RAM headline and find the VRAM. On a PC that's the GPU's number, and it's smaller than you'd hope. On a Mac, unified memory is your VRAM — minus the cut macOS keeps, so budget for roughly three-quarters of it. Size the model to that number, not the one on the box.

So: do you know the VRAM number on the machine you're reading this on? Not the RAM. The VRAM. Most people have never looked. I hadn't.

This is post 2 of a series pulling apart everything the API quietly handled for you. Next: the KV cache — what your context window actually costs in gigabytes, and why it grows as you type.

Subscribe and I'll send each one as it lands.

I never touched that dial on the Mac. Turns out the reason was the whole lesson.

For running LLMs, your RAM number is a lie. Look at the VRAM.

Keep Reading

The Agent Engineer