Question 1

How much VRAM do I need to run an LLM?

Accepted Answer

As a rule of thumb, VRAM needed equals the number of parameters (in billions) times the bytes per weight, plus overhead. A 7B model at 4-bit needs roughly 4–5 GB; the same model at full FP16 precision needs about 16 GB. A 70B model at 4-bit needs around 40–48 GB. Use the calculator above for your exact model and use case.

Question 2

What GPU do I need for AI?

Accepted Answer

It depends on the model size and how you quantize it. Small models (3B–8B) at 4-bit run on a mainstream consumer GPU like an RTX 4060 12GB. Mid-size models (13B–32B) want a 24GB card such as an RTX 4090. Large models (70B+) need a workstation or data-center GPU like an RTX 6000 Ada or an A100/H100.

Question 3

Can I run an LLM on a laptop?

Accepted Answer

Yes, for small models. A laptop with 16GB of unified memory or a 8GB+ discrete GPU can run a quantized 7B–8B model for chat and light tasks. Apple Silicon laptops with 32GB+ of unified memory can run larger models because the GPU shares system memory. Bigger models still need a desktop or server GPU.

Question 4

What is the best GPU for running LLMs locally?

Accepted Answer

For most people running local LLMs, an NVIDIA RTX 4090 (24GB) is the sweet spot for price versus capability — it runs models up to about 32B at 4-bit. On a budget, an RTX 4060 Ti 16GB handles 7B–13B models well. For business on-premise use with large models, workstation cards (RTX 6000 Ada 48GB) or data-center GPUs (A100/H100 80GB) are the standard.

Question 5

Do I need an NVIDIA GPU to run AI locally?

Accepted Answer

No, but NVIDIA is the smoothest path because most tooling targets its CUDA platform. Apple Silicon (M-series) runs local models well through its unified memory and Metal, and is popular for on-device use. AMD GPUs work through ROCm but with more setup friction. For a business deploying at scale, NVIDIA remains the default.

Question 6

How much VRAM does a 70B model need?

Accepted Answer

A 70B-parameter model needs roughly 140 GB at full FP16 precision, about 70 GB at 8-bit, and around 40–48 GB at 4-bit quantization. That means a single 48GB workstation GPU or an 80GB data-center GPU for the quantized version, and multiple GPUs for full precision. Quantization is what makes large models practical to self-host.

Model size	VRAM (4-bit)	Hardware that fits
3B–8B	~4–6 GB	Mainstream laptop or RTX 4060 12GB
13B	~8–10 GB	RTX 4060 Ti 16GB
32B	~18–22 GB	RTX 4090 24GB
70B	~40–48 GB	RTX 6000 Ada 48GB or A100 80GB
120B+	70 GB+	Multi-GPU server (2×+ H100)

Budget	GPU	Runs (4-bit)
Entry	RTX 4060 Ti 16GB	Up to ~13B
Enthusiast	RTX 4090 24GB	Up to ~32B
Workstation	RTX 6000 Ada 48GB	Up to ~70B
Data center	A100 / H100 80GB	70B comfortably; larger with multi-GPU

Local AI Hardware Calculator

Estimate Your Local AI Hardware

How the Estimate Works

VRAM by Model Size (4-bit Quantized)

How to Run an LLM Locally

Best GPU for Local LLMs

Frequently Asked Questions

Deciding whether to run AI in-house?

Related Resources

Private AI for Business

How to Run Open-Weights Models

The Real Cost of Open-Weights Models

Best AI Mini PCs for Business

Best Open-Weights AI Models for Business

Are Open-Weights Models Safe?