Quant Picker: Which GGUF File Should You Download?

Pick your model and your machine — get the exact quant to download, the file size, and how much context you'll have left.

The picker needs JavaScript. The short version: file size = parameters × bits-per-weight ÷ 8; whatever memory is left after the file and overhead is your context budget. For most people Q4_K_M is the sweet spot, grab GGUFs from the community-trusted quantizers bartowski or unsloth. See the full method in our quantization guide, or use Can I run it? to find hardware that fits a model. The explainer below covers the rest.

📎 Run a site or newsletter? Use the Cite or Embed buttons just above to link to this tool or embed the live version on your own page, free, no signup, just keep the credit.

How to read the table

It balances three things at once: quality (more bits = better), context (bigger files leave less room for the KV cache), and speed (bigger files stream slower). Tell it the context you need and it picks the highest-quality quant that fits, then shows the approximate tokens/sec each quant runs on your machine, so you can trade quality for speed deliberately. It also points you to the community-trusted GGUF makers (bartowski, unsloth) and the smaller I-quants you’ll see in their repos.

Every GGUF model ships in multiple quantization levels, same model, different precision, different file size. The trade is simple: more bits = better quality = bigger file = less room left for context. This tool does the arithmetic for your exact machine: file size per quant, then whatever memory remains becomes your context budget (the KV cache eats it per token).

The recommendation logic is the community consensus from our quantization guide: take the highest quant that still leaves ≥8k of context. Q6/Q5 are near-lossless, Q4_K_M is the sweet spot, and below Q3 quality falls off fast, if you're forced down there, you usually want a smaller model instead (a bigger model at Q4 beats a smaller one at Q8, but a Q2 of anything beats very little).

Real limits

File sizes are computed from bits-per-weight, not scraped from Hugging Face, real files vary a little by quantizer version (K-quants vs I-quants, imatrix variants). The KV-cache math assumes a GQA-typical architecture; exotic models differ. And max context here is what fits, models also have their own context limits, and quality at extreme context is its own story. Treat the numbers as a reliable guide, not a contract.

The tool family

Shopping rather than downloading? Can I run it? finds hardware that fits a model. Wondering if you should buy hardware at all? The cost calculator compares buying vs renting vs the API.

How to read the table

Real limits

The tool family

Get the Vetted Consumer newsletter