Local AI Hardware Calculator
See exactly what GPU, VRAM, and RAM you need to run an LLM locally — pick a model and use case, no sign-up required.
The question "what hardware do I need to run AI locally?" has one honest answer: it depends on the model size and how you quantize it. This calculator turns that into a concrete number — the VRAM, GPU tier, and system RAM you need to run a given model on your own hardware, whether that is a laptop, a workstation, or an on-premise server.
Estimate Your Local AI Hardware
How the Estimate Works
The estimate uses the standard rule of thumb for model memory. It is directional, not a guarantee, but it is close enough to plan a purchase or a cloud instance.
- Weights memory: Parameters (in billions) multiplied by bytes per weight. Full FP16 precision is 2 bytes per weight, 8-bit is 1 byte, and 4-bit quantization is about 0.5 bytes.
- Overhead: A multiplier for the KV cache, activations, and adapters. Plain inference adds roughly 20 percent; long-context RAG and LoRA fine-tuning add more.
- System RAM: A safe target is about 1.5 times your VRAM, with a 16 GB floor, so the model can load and the operating system stays responsive.
- GPU tier: The total VRAM maps to a class of card — consumer, prosumer, workstation, or data-center.
VRAM by Model Size (4-bit Quantized)
Most people run local LLMs at 4-bit quantization, which is the practical sweet spot. Here is roughly what each model size needs and the hardware that fits.
| Model size | VRAM (4-bit) | Hardware that fits |
|---|---|---|
| 3B–8B | ~4–6 GB | Mainstream laptop or RTX 4060 12GB |
| 13B | ~8–10 GB | RTX 4060 Ti 16GB |
| 32B | ~18–22 GB | RTX 4090 24GB |
| 70B | ~40–48 GB | RTX 6000 Ada 48GB or A100 80GB |
| 120B+ | 70 GB+ | Multi-GPU server (2×+ H100) |
How to Run an LLM Locally
Running an LLM locally takes three things: hardware that fits the model, a runtime to serve it, and an open-weights model to load. Once the hardware is sized with the calculator above, the software side is straightforward.
- Pick a runtime: Tools like Ollama, LM Studio, or llama.cpp run models on a laptop or desktop with little setup. For a server, vLLM and TGI serve models to many users at once.
- Choose a model: Download a quantized open-weights model that fits your VRAM budget — a small Mistral, Llama, or Qwen model is a good start.
- Serve it: Point your app at the local endpoint. Most runtimes expose an OpenAI-compatible API, so existing code often works with a one-line URL change.
For the full deployment picture — on-premise versus private cloud, security, and cost — read our how to run open-weights models guide, or our private AI for business decision guide.
Best GPU for Local LLMs
The best GPU for running LLMs locally is the one with the most VRAM you can justify, because memory — not raw speed — is what decides which models you can run at all. NVIDIA cards are the smoothest path thanks to broad tooling support.
| Budget | GPU | Runs (4-bit) |
|---|---|---|
| Entry | RTX 4060 Ti 16GB | Up to ~13B |
| Enthusiast | RTX 4090 24GB | Up to ~32B |
| Workstation | RTX 6000 Ada 48GB | Up to ~70B |
| Data center | A100 / H100 80GB | 70B comfortably; larger with multi-GPU |
Apple Silicon is a strong alternative for on-device use: because the GPU shares unified memory, a Mac with 64GB can run models that would need a 48GB discrete GPU. For business deployments serving many users, NVIDIA server GPUs remain the standard.
Frequently Asked Questions
- As a rule of thumb, VRAM needed equals the number of parameters (in billions) times the bytes per weight, plus overhead. A 7B model at 4-bit needs roughly 4–5 GB; the same model at full FP16 precision needs about 16 GB. A 70B model at 4-bit needs around 40–48 GB. Use the calculator above for your exact model and use case.
- It depends on the model size and how you quantize it. Small models (3B–8B) at 4-bit run on a mainstream consumer GPU like an RTX 4060 12GB. Mid-size models (13B–32B) want a 24GB card such as an RTX 4090. Large models (70B+) need a workstation or data-center GPU like an RTX 6000 Ada or an A100/H100.
- Yes, for small models. A laptop with 16GB of unified memory or a 8GB+ discrete GPU can run a quantized 7B–8B model for chat and light tasks. Apple Silicon laptops with 32GB+ of unified memory can run larger models because the GPU shares system memory. Bigger models still need a desktop or server GPU.
- For most people running local LLMs, an NVIDIA RTX 4090 (24GB) is the sweet spot for price versus capability — it runs models up to about 32B at 4-bit. On a budget, an RTX 4060 Ti 16GB handles 7B–13B models well. For business on-premise use with large models, workstation cards (RTX 6000 Ada 48GB) or data-center GPUs (A100/H100 80GB) are the standard.
- No, but NVIDIA is the smoothest path because most tooling targets its CUDA platform. Apple Silicon (M-series) runs local models well through its unified memory and Metal, and is popular for on-device use. AMD GPUs work through ROCm but with more setup friction. For a business deploying at scale, NVIDIA remains the default.
- A 70B-parameter model needs roughly 140 GB at full FP16 precision, about 70 GB at 8-bit, and around 40–48 GB at 4-bit quantization. That means a single 48GB workstation GPU or an 80GB data-center GPU for the quantized version, and multiple GPUs for full precision. Quantization is what makes large models practical to self-host.
Deciding whether to run AI in-house?
This calculator sizes the hardware. Our free assessment maps the whole decision — private cloud versus on-premise, model choice, cost versus a closed API — to your data, volume, and compliance needs.
Book a Consultation