Every local-AI guide starts with “just get a 24 GB card.” Most of us won’t. Here’s what actually fits on the GPU you have, and what to expect from it.
The one rule that matters
Local LLM inference is memory-bound, not compute-bound. The question is never “is my GPU fast enough?” — it’s “does the model fit in VRAM?” If it fits, it will run at usable speed on almost any modern card. If it doesn’t, no amount of GPU power saves you.
A quick rule of thumb for 4-bit quantized models (the standard for local use):
VRAM needed ≈ parameters (B) × 0.6 GB, plus 1–2 GB for context.
What fits where
| Your VRAM | Models that fit (Q4) | What it feels like |
|---|---|---|
| 4 GB | 1B–3B (Llama 3.2 3B, Gemma 4B) | Fast replies, simple tasks, weak reasoning |
| 8 GB | 7B–8B (Qwen 7B, Llama 8B) | The sweet spot for chat and light coding |
| 12 GB | up to 13B–14B | Noticeably better instruction-following |
| 16 GB | up to ~20B, or 14B with long context | Comfortable daily-driver territory |
| 24 GB | 27B–32B (Qwen coder-class models) | Local models that rival cloud for many tasks |
Two practical notes from our own runs:
- Context eats VRAM too. A 7B model that fits fine at 4K context can overflow an 8 GB card at 32K. If you hit out-of-memory errors with a model that “should fit,” shrink the context window first.
- Partial CPU offload works, but it’s slow. Most runtimes can spill layers to system RAM. Expect a hard drop in tokens/second the moment you cross the VRAM line — fine for batch jobs, painful for chat.
If you’re buying used
Per euro of VRAM, used cards beat new ones by a wide margin. The cards we keep recommending: anything with 12 GB+ from the last two generations, prioritizing VRAM over GPU tier. A used 12 GB mid-range card will serve local LLMs better than a new 8 GB flagship.
No GPU at all? You can still run 7B-class models on CPU with enough RAM, or use a free cloud T4 (16 GB) to experiment before spending anything — that’s how half the guides on this site were originally tested.
Bottom line
Don’t start from the GPU you wish you had. Pick the largest Q4 model that fits your VRAM with 1–2 GB of headroom, and you’ll get 90% of the local-AI experience on hardware you already own.