tutorials 8 GB VRAM

Run Your First Local LLM with Ollama in 10 Minutes

This is the fastest path from nothing to a working local LLM. No Python environment, no CUDA wrestling — Ollama handles all of it.

1. Install Ollama

Download the installer from ollama.com (Windows, macOS, Linux). On Linux it’s one line:

curl -fsSL https://ollama.com/install.sh | sh

The installer detects your GPU automatically — NVIDIA via CUDA, AMD via ROCm, Apple Silicon natively.

2. Pick a model that fits

The biggest beginner mistake is pulling a model too large for your VRAM. Match the model to your card:

VRAMStart with
4 GBllama3.2:3b
8 GBqwen2.5:7b
12–16 GBqwen2.5:14b
24 GBqwen2.5:32b

Then pull and run it:

ollama run qwen2.5:7b

The first run downloads the model (a 7B Q4 is roughly 4–5 GB). After that, it loads from disk in seconds and you’re chatting in your terminal.

3. Verify it’s on the GPU

If replies feel slow, check whether the model actually loaded into VRAM:

ollama ps

Look at the PROCESSOR column: 100% GPU is what you want. If you see a CPU percentage, the model didn’t fully fit — switch to a smaller model or a tighter quantization (:7b-instruct-q4_K_M style tags let you pick).

4. Use it from code

Ollama exposes a local API on port 11434, so any script can talk to it:

import requests

r = requests.post("http://localhost:11434/api/generate", json={
    "model": "qwen2.5:7b",
    "prompt": "Explain VRAM in one paragraph.",
    "stream": False,
})
print(r.json()["response"])

That’s the foundation for everything else on this site: once a local model answers on localhost, you can wire it into editors, scripts, voice assistants — anything.

Where to go next

The natural next step is giving your model documents to work with (local RAG) or a web interface. Both build directly on the setup you just finished.