How Much VRAM Do You Actually Need to Run LLMs Locally?

Every local-AI guide starts with “just get a 24 GB card.” Most of us won’t. Here’s what actually fits on the GPU you have, and what to expect from it.

The one rule that matters

Local LLM inference is memory-bound, not compute-bound. The question is never “is my GPU fast enough?” — it’s “does the model fit in VRAM?” If it fits, it will run at usable speed on almost any modern card. If it doesn’t, no amount of GPU power saves you.

A quick rule of thumb for 4-bit quantized models (the standard for local use):

VRAM needed ≈ parameters (B) × 0.6 GB, plus 1–2 GB for context.

What fits where

Your VRAM	Models that fit (Q4)	What it feels like
4 GB	1B–3B (Llama 3.2 3B, Gemma 4B)	Fast replies, simple tasks, weak reasoning
8 GB	7B–8B (Qwen 7B, Llama 8B)	The sweet spot for chat and light coding
12 GB	up to 13B–14B	Noticeably better instruction-following
16 GB	up to ~20B, or 14B with long context	Comfortable daily-driver territory
24 GB	27B–32B (Qwen coder-class models)	Local models that rival cloud for many tasks

Two practical notes from our own runs:

Context eats VRAM too. A 7B model that fits fine at 4K context can overflow an 8 GB card at 32K. If you hit out-of-memory errors with a model that “should fit,” shrink the context window first.
Partial CPU offload works, but it’s slow. Most runtimes can spill layers to system RAM. Expect a hard drop in tokens/second the moment you cross the VRAM line — fine for batch jobs, painful for chat.

If you’re buying used

Per euro of VRAM, used cards beat new ones by a wide margin. The cards we keep recommending: anything with 12 GB+ from the last two generations, prioritizing VRAM over GPU tier. A used 12 GB mid-range card will serve local LLMs better than a new 8 GB flagship.

No GPU at all? You can still run 7B-class models on CPU with enough RAM, or use a free cloud T4 (16 GB) to experiment before spending anything — that’s how half the guides on this site were originally tested.

Bottom line

Don’t start from the GPU you wish you had. Pick the largest Q4 model that fits your VRAM with 1–2 GB of headroom, and you’ll get 90% of the local-AI experience on hardware you already own.