Last post I said n_gpu_layers decides how many of a model's layers go on the GPU. Fair thing to throw back at me: how do you put part of a model anywhere? You can't load half a brain.

Turns out you can. Because a model isn't one object. It's a stack.

A large language model is built from a pile of near-identical blocks, stacked floor on floor. Each block is called a layer. Your tokens enter at the bottom, pass up through every floor in order, and the answer comes out the top. One block at a time, always the same path.

The number of floors is fixed per model, and it's smaller than you'd guess. A 7B model in the Llama family stacks 32 of them. The 70B stacks 80. Bigger models mostly just add floors (and make each one wider). That single integer (call it the depth) is one of the few numbers that actually defines the architecture.

So what's on each floor?

Two things, every time. An attention step, where each token looks at the others and decides what's relevant right now. Then a feed-forward step: a small dense network that reworks each token on its own. Attention mixes information across the sentence; the feed-forward part does the private, per-token thinking. Same two-part shape on all 32 floors, or all 80, with different learned weights and identical structure.

That sameness is the part to hold onto. The model isn't a sprawling custom machine. It's one block design, photocopied dozens of times, each copy carrying its own numbers.

And that's the answer to the question up top.

Because the stack is just an ordered list of blocks, you can park the first chunk of floors in one place and the rest somewhere else. Put floors 1 through 20 in fast GPU memory, leave 21 through 32 on the CPU. The model still runs — data climbs the GPU floors at full speed, then crosses to the slow ones for the remainder. That handoff is the offloading from last post, and n_gpu_layers is the number where you draw the line. It can't go past the floor count, because there's no floor 33 to offload.

That's why "half a model on the GPU" isn't loose talk. It's twenty floors here, twelve there.

Find your model's depth before you tune anything. It's in the config, usually as num_hidden_layers, and on most model cards. That integer is the ceiling for n_gpu_layers, and it's the unit a lot of memory and speed math quietly counts in. Once you can picture the stack, "offload some layers" stops being jargon and turns into something you can see.

So: do you know how many floors the model you ran last is built from? Ten seconds to check now. I'd bet it's fewer than you think.

This is a sidebar to the series pulling apart what the API hid — the piece that makes "layers" mean something everywhere else it shows up. Back to the mainline next: the KV cache, and what your context window actually costs.

Subscribe and I'll send each one as it lands.

You can't load half a brain. You can load twenty floors of one.

What to Read Next

For running LLMs, your RAM number is a lie. Look at the VRAM.

Few day's after I had the Mac Mini running models, I found the setting half the internet argues about: n_gpu_layers. On a gaming PC it's the make-or-break dial. I'd never once touched it.

The Agent Engineer • Yatharth Lakhera

What a layer in an LLM actually is

What to Read Next

Keep Reading

The Agent Engineer