Blog

Can I run it? — Local LLM hardware calculator

Pick a model, quant, and context length — get the real memory math and the hardware that can actually run it.

One step earlier: not sure you should buy hardware at all? Our cost calculator compares buying vs renting cloud GPUs vs just paying for an API, with break-even math for your usage.

Two ways to use it: leave "Your machine" empty to shop across everything we track, or pick the hardware you already own (or enter its memory) to get a personal verdict — including, when it doesn't fit, the exact quant, context, or KV-cache change that would make it fit.

How the estimate works

The tool uses the same math from our guides, shown in the open because that's the point of this site. A model's memory cost has three parts:

  • Weights — parameters × bits-per-weight ÷ 8. A 70B model at Q4_K_M (~4.8 bits/weight) is about 42 GB. Quantization choices are covered in our plain-English quantization guide.
  • KV cache — grows with every token of context. We assume a GQA-typical attention shape and an FP16 cache; the KV-precision selector in the tool shows exactly what a Q8 or Q4 cache saves. Full math in The KV cache, explained.
  • Overhead — a flat ~1.5 GB buffer for the runtime and activations.

For Mixture-of-Experts models, memory follows total parameters but speed follows active parameters — that's why a 120B MoE can be fast on a box that would crawl on a dense 70B. The one-line rule: buy memory for the total, expect speed from the active (MoE, explained).

The "gen ceiling" column is memory bandwidth ÷ bytes streamed per token — a theoretical upper bound from the fact that token generation is bandwidth-bound, not compute-bound (why that is). Real speeds come in below it.

Honest limits

These are estimates, not lab measurements. Real usage varies by runtime (llama.cpp vs vLLM vs MLX), KV-cache precision, batch settings, and model architecture. Unified-memory machines share RAM with the OS, so we subtract an 8 GB reserve; discrete GPUs lose ~1 GB to the desktop. When a result says "tight fit," believe it — within 10% of capacity means long context or background apps will push you over. Hardware listings come from our methodology; affiliate links never influence what appears or how it ranks.

Get the Vetted Consumer newsletter

Reviews, buying advice, and field notes. Delivered monthly.

Almost there — check your inbox and click the confirmation link. ✓

Something went wrong — please try again, or email hello@vettedconsumer.com.