GPT-OSS-20B is a 21-billion-parameter MoE model (Apache-2.0) with a 128k native context. Because it is a Mixture-of-Experts model, capacity is set by the full 21B but speed comes from the ~3.6B active per token, so it fits where its total size lands but generates faster than a dense model that big. OpenAI's smaller efficient MoE model optimized for local deployment within 16GB memory. The question for running it locally is memory: whether it fits, and how fast, comes down to your hardware and which quant you pick. Below is every machine we track, what fits, and the real speeds.
The short version: you need about 16 GB of memory to run GPT-OSS-20B comfortably at a 4-bit quant.
The full hardware-fit matrix
Every machine we track, at 8k context with an f16 KV cache, showing the highest-quality GGUF quant that fits. The tok/s figure is a theoretical ceiling from memory bandwidth; real generation runs lower, which is why the owner-measured column matters. 18 of the 23 machines here can run GPT-OSS-20B.
| Machine | Memory | Best quant (8k ctx) | Weights size | ~tok/s (ceiling) | Owners measure | Verdict |
|---|---|---|---|---|---|---|
| NVIDIA RTX 3060 12GB | 12 GB | – | – | – | – | Does not fit |
| NVIDIA RTX 4070 12GB | 12 GB | – | – | – | ~10 tok/s | Does not fit |
| NVIDIA RTX 5070 12GB | 12 GB | – | – | – | – | Does not fit |
| Intel Arc B580 12GB | 12 GB | – | – | – | ~34 tok/s | Does not fit |
| NVIDIA RTX 4060 Ti 16GB | 16 GB | IQ4_XS | ~11.2 GB | ~82 | ~57.8–63.2 tok/s | Runs well |
| NVIDIA RTX 4080 16GB | 16 GB | IQ4_XS | ~11.2 GB | ~204 | ~106–136 tok/s | Runs well |
| NVIDIA RTX 5080 16GB | 16 GB | IQ4_XS | ~11.2 GB | ~272 | ~149–172 tok/s | Runs well |
| NVIDIA RTX 5070 Ti 16GB | 16 GB | IQ4_XS | ~11.2 GB | ~254 | ~115–156 tok/s | Runs well |
| NVIDIA RTX 5060 Ti 16GB | 16 GB | IQ4_XS | ~11.2 GB | ~127 | ~73–92 tok/s | Runs well |
| AMD RX 9060 XT 16GB | 16 GB | IQ4_XS | ~11.2 GB | ~91 | – | Runs well |
| Mac, 16GB unified (M-series) | 16 GB (unified) | – | – | – | – | Does not fit |
| NVIDIA RTX 3090 24GB | 24 GB | Q6_K | ~17.3 GB | ~204 | ~161 tok/s | Runs well |
| NVIDIA RTX 4090 24GB | 24 GB | Q6_K | ~17.3 GB | ~220 | ~225 tok/s | Runs well |
| AMD RX 7900 XTX 24GB | 24 GB | Q6_K | ~17.3 GB | ~210 | ~100 tok/s | Runs well |
| NVIDIA RTX 5090 32GB | 32 GB | Q8_0 | ~22.3 GB | ~330 | ~282 tok/s | Runs well |
| Mac, 32GB unified (Pro-class) | 32 GB (unified) | Q6_K | ~17.3 GB | ~60 | ~92 tok/s | Runs well |
| CPU-only PC, 32GB DDR5 | 32 GB (unified) | Q6_K | ~17.3 GB | ~20 | – | Usable |
| 2× RTX 3090 (48GB) | 48 GB | FP16 | ~42 GB | ~106 | – | Runs well |
| Mac, 64GB unified (Max-class) | 64 GB (unified) | FP16 | ~42 GB | ~62 | ~75 tok/s | Runs well |
| CPU-only PC, 64GB DDR5 | 64 GB (unified) | FP16 | ~42 GB | ~10 | – | Usable |
| Mac, 128GB unified (Max-class) | 128 GB (unified) | FP16 | ~42 GB | ~62 | – | Runs well |
| Strix Halo box, 128GB unified | 128 GB (unified) | FP16 | ~42 GB | ~29 | ~46.2–49.9 tok/s | Runs well |
| Mac Studio, 512GB unified (Ultra) | 512 GB (unified) | FP16 | ~42 GB | ~93 | ~115–116 tok/s | Runs well |
Weights size is the model file at the listed quant; the KV cache and about 1.5 GB of overhead are already factored into the fit. Change the context or KV precision in the calculator to see how it shifts.
The cheapest box that runs it
The cheapest catalogued, buyable machine that runs GPT-OSS-20B at Q4 is the AMD RX 7900 XTX 24GB (about $849). Check it against your needs, and weigh buying versus renting in the cost calculator.
What owners measure
Theoretical ceilings assume perfect efficiency; real generation runs lower. Reported numbers for this model class, each linked to its source:
- Intel Arc B580 12GB: ~34 tok/s (source)
- NVIDIA RTX 4080 16GB: ~106–136 tok/s (source)
- NVIDIA RTX 5080 16GB: ~149–172 tok/s (source)
- NVIDIA RTX 5070 Ti 16GB: ~115–156 tok/s (source)
- NVIDIA RTX 5060 Ti 16GB: ~73–92 tok/s (source)
- NVIDIA RTX 3090 24GB: ~161 tok/s (source)
- NVIDIA RTX 4090 24GB: ~225 tok/s (source)
- AMD RX 7900 XTX 24GB: ~100 tok/s (source)
- NVIDIA RTX 5090 32GB: ~282 tok/s (source)
- Mac, 32GB unified (Pro-class): ~92 tok/s (source)
- Mac, 64GB unified (Max-class): ~75 tok/s (source)
- Mac Studio, 512GB unified (Ultra): ~115–116 tok/s (source)
How this is calculated
Memory math, no magic. The model file is parameters × bits-per-weight / 8, so this model at Q4_K_M (about 4.8 bits) is roughly 12.6 GB. On top you need the KV cache plus about 1.5 GB overhead. Generation speed is bound by memory bandwidth: the card re-reads the active weights every token, so tok/s is roughly bandwidth / bytes-per-token. See the quantization guide, the VRAM guide, and prompt processing vs generation.
Check your exact setup
This matrix uses 8k context. Your numbers shift with longer context, a different KV precision, or CPU offload. Plug in your exact machine:
- Can I Run It? tests GPT-OSS-20B against any hardware, including your own machine.
- Quant picker shows the full quant ladder so you can trade quality for context.
- Cost calculator weighs buying against renting a cloud GPU.
Sources and method
- Fit and theoretical tok/s: computed from each machine's memory and bandwidth with the same engine as our calculator.
- Owner-measured tok/s: aggregated from the cited public benchmarks; where the measured quant differs it is noted at the source.
- A fit-and-speed reference, not first-hand testing by Vetted Consumer.
