Software & Tools

Mixture-of-Experts (MoE), Explained: Why “Active Parameters” Decide What Runs on Your Machine

How does a 671B model run on a desktop while a 70B dense model chokes? The answer is Mixture-of-Experts, and the one number that actually predicts speed and memory. A plain-English guide with real benchmarks and the papers behind it.

Thomas Newkirk June 11, 2026 7 min read

Mixture-of-Experts (MoE), Explained: Why “Active Parameters” Decide What Runs on Your Machine

Here is a puzzle that trips up almost everyone new to local AI: a 671-billion-parameter model can run at usable speeds on the right desktop, while a "smaller" 70B model feels sluggish on the same hardware. How? The answer is an architecture called Mixture-of-Experts (MoE), and once you understand the single number it hinges on, model names like "Qwen 35B-A3B" or "DeepSeek-V3 671B-A37B" suddenly tell you exactly what your machine can and can't do.

This is the plain-English version: what MoE is, the one number that predicts performance, the memory trap that catches buyers, and the research behind it. No assuming you've read the papers.

The old rule MoE broke

In a traditional "dense" model, every parameter fires for every token. A 70B dense model does 70 billion parameters' worth of math to produce each word. That makes quality and speed a straight trade-off: bigger means smarter and slower. For years that was the iron law of local LLMs.

Mixture-of-Experts breaks it. Instead of one giant network, an MoE model splits much of itself into many smaller sub-networks called experts. For each token, a small router network picks just a few experts to run, the rest sit idle. So the model can be enormous in total, but only a slice of it does work at any moment. You get the knowledge of a huge model at the compute cost of a small one.

The one number that matters: active vs total parameters

Every MoE model has two parameter counts, and confusing them is the single most common mistake:

Total parameters, every expert added up. This determines how much memory the model needs.
Active (or "activated") parameters, what runs per token. This determines speed and compute.

The naming convention you'll see encodes both. "A3B" means 3B active; the number before it is the total. Some real examples:

Model	Total params	Active params	Runs like a…
Mixtral 8×7B	47B	13B (top-2 of 8 experts)	13B for speed, 47B for smarts
"35B-A3B" class	~35B	~3B	3B-fast, 35B-smart
DeepSeek-V3	671B	37B	37B for speed, 671B for smarts

Mixtral is the model that made this mainstream: per its technical report, each token routes to 2 of 8 experts, so it touches just 13B of its 47B parameters, yet it matches or beats Llama 2 70B and GPT-3.5. DeepSeek-V3 pushes the idea to the frontier: 671B total, 37B active. It thinks like a 671B model but computes like a 37B one.

Under the hood: what an “expert” is

It’s tempting to picture experts as little specialists, one for code, one for French, one for math. That’s a myth. An “expert” is simply a copy of the model’s feed-forward block (the dense number-crunching layer that sits after attention). An MoE swaps the single feed-forward block in each layer for many of them, and the router learns which combination to use, the specialization that emerges is statistical and messy, not human-readable.

Two details matter for understanding the behavior:

Attention usually stays dense. Only the feed-forward layers are split into experts; the attention mechanism still runs in full for every token. That’s part of why MoE quality doesn’t collapse despite the sparsity, the part of the model that mixes context together is untouched.
The router is tiny but decisive. A small gating network scores the experts per token and picks the top few. Train it badly and you get “dead” experts that never fire or hot experts that overload, the load-balancing problem that Switch Transformers and later DeepSeek-V3 spent real effort solving.

Newer designs add a twist: shared experts that run for every token (capturing common knowledge) alongside the routed experts that specialize, plus “fine-grained” experts, more, smaller experts for finer routing. That’s the DeepSeek recipe, and it’s why its active count (37B) buys more than the raw number suggests.

The memory trap: MoE saves compute, NOT memory

This is the part that catches buyers, so read it twice. Active parameters set your speed. Total parameters set your memory. The router only runs a few experts per token, but it could pick any of them next token, so all the experts must be sitting in fast memory, ready. You don't get to store only the active slice.

Concretely:

A ~35B-total MoE at 4-bit needs roughly ~18–20 GB just to hold the weights, the same as a 35B dense model, even though only ~3B are active. The memory bill is set by the total.
DeepSeek-V3's 671B, even quantized to 4-bit, wants ~380 GB, server-and-cluster territory, despite "only" 37B active. Fast to run if you can hold it; almost nobody can.

This is exactly why the large-unified-memory box became the local-LLM darling. A 128 GB Strix Halo, Framework Desktop, or Mac Studio isn't about raw compute, it's about having enough fast memory to hold every expert of a big MoE, so the model's tiny active footprint can then rip through tokens. MoE is the software trend that makes high-capacity-memory hardware worth buying. (More on the boxes in our Unified-Memory AI guides and how much VRAM you actually need.)

What it looks like in the real world

The speed side of the bargain is dramatic, and owners notice immediately. In a popular r/LocalLLaMA thread titled "Qwen3.5-35B-A3B is a gamechanger for agentic coding" (u/jslominski), owners reported running a ~35B-total model at speeds you'd normally only see from a 3B model, because, for compute, it is a 3B model:

"On RTX 5060 Ti 16 GB + 32 GB RAM, I got 800 t/s pp and 35 t/s tg."
, a commenter in that thread, running a 35B-class MoE on a mid-range GPU

"On M2 Max 64 GB MBP, I got 350 t/s pp and 27 t/s tg."
, another commenter, same thread

A 35B dense model on that 16 GB card would crawl, if it fit at all. The MoE flies because only ~3B parameters fire per token, while the other ~32B wait in the 32 GB of system RAM. That's the whole magic trick in one benchmark: memory holds the total, speed follows the active.

A 60-second history (with the receipts)

MoE isn't new, it's a 2017 idea that finally met the right moment:

2017, the origin. Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", introduced the trainable router + experts design, showing >1000× more model capacity at minor extra compute.
2021, made practical. Fedus, Zoph & Shazeer's Switch Transformers simplified routing to a single expert per token, hit up to 7× faster pre-training, and scaled to a trillion parameters.
2024, mainstream. Mixtral 8×7B proved an open MoE could beat a much larger dense model, and the local community ran with it.
2024–25, the frontier. DeepSeek-V3 refined the recipe (fine-grained experts plus always-on "shared" experts, and load balancing without an auxiliary loss) to reach 671B/37B at a fraction of the training cost.

The trade-offs (because there's no free lunch)

MoE isn't strictly better than dense, it's a different set of compromises:

Lower quality per total parameter. A 47B MoE is not as capable as a hypothetical 47B dense model would be; it's roughly "13B-active smart." You're trading parameter efficiency for speed. The win is that the memory is often affordable while the compute is cheap.
It's memory-bound, not compute-bound. Performance leans on memory bandwidth and capacity, not raw TFLOPS, which is why Apple Silicon and unified-memory boxes punch above their weight on MoE, and why offloading idle experts to slower RAM works at all.
Routing can be uneven, and some users find MoEs weaker at tasks needing tight, consistent global reasoning, one r/LocalLLaMA benchmark thread flatly noted "MoEs struggle with strict global rules." Treat the architecture as a speed/capacity win, not a guaranteed quality win.
Quantization still applies on top. MoE sets the memory before you quantize; you then pick a quant (see our GGUF vs GPTQ vs AWQ guide) to fit it in your box.

The cheat sheet

Question	Look at…
Will it fit in my machine?	Total parameters (× bytes-per-weight for your quant)
How fast will it generate?	Active parameters
What does "35B-A3B" mean?	35B total (memory), 3B active (speed)
Why buy a 128 GB box?	To hold every expert of a big MoE
Dense vs MoE at the same size?	Dense = smarter per GB; MoE = faster per GB

The one-line rule to remember: buy memory for the total, expect speed from the active. Once you read model names that way, the whole local-LLM hardware market stops being mysterious, you can look at "671B-A37B" and instantly know it's a cluster model, or "30B-A3B" and know it'll fly on a mid-range box with enough RAM.

Dense or MoE: which should you download?

Both architectures are everywhere now, and the right choice comes down to which resource you’re short on.

Reach for an MoE when:

You have plenty of memory but modest compute, the classic unified-memory box, or a mid-range GPU paired with lots of system RAM. MoE turns spare memory into speed.
You want big-model knowledge at interactive speeds, agentic coding, long chat sessions, anything where tokens-per-second matters as much as raw smarts.
You can offload idle experts to CPU RAM. Because only the active experts move per token, engines like llama.cpp can park the rest in slower system memory with a smaller speed penalty than you’d expect, which is how a 16 GB GPU runs a 35B-total model at all.

Reach for a dense model when:

You’re memory-constrained and want maximum quality per gigabyte. A 14B dense model can be a better use of 10 GB than a 30B MoE you can barely fit.
Your task needs tight, consistent reasoning, where MoE routing variability has been a sore point for some users.
You’re on a single fast GPU whose compute, not memory, is the bottleneck; dense models use that compute fully.

A useful mental model: dense models are limited by compute, MoE models are limited by memory. Match the architecture to whichever you have to spare. If you bought a 128 GB box precisely so you’d never run out of memory, MoE is how you cash in that investment.

Sources & how we researched this

This explainer synthesizes the primary MoE literature, Shazeer et al. (2017), Switch Transformers (2021), the Mixtral report (2024), and the DeepSeek-V3 technical report (2024), for the architecture and the active/total parameter figures, which come straight from those papers. The real-world speed numbers and the "gamechanger" framing are owner reports from r/LocalLLaMA, linked so you can verify; we have not benchmarked these machines first-hand. Memory estimates are weights-only approximations rounded for clarity (add headroom for context and KV cache).

The old rule MoE broke

The one number that matters: active vs total parameters

Under the hood: what an “expert” is

The memory trap: MoE saves compute, NOT memory

What it looks like in the real world

A 60-second history (with the receipts)

The trade-offs (because there's no free lunch)

The cheat sheet

Dense or MoE: which should you download?

Sources & how we researched this

Related guides

Related posts

Speculative Decoding, Explained: The Free Speed Toggle Your Local LLM Is Probably Not Using

Bandwidth, Not TFLOPS: What Sets Your Local LLM Speed (and Why the Newest Card Isn't Always Fastest)

GPT-5.6 is here, and you can't run it. Here's what you can run instead.

Get the Vetted Consumer newsletter