Local AI Hardware Calculator

See exactly what GPU, VRAM, and RAM you need to run an LLM locally — pick a model and use case, no sign-up required.

The question "what hardware do I need to run AI locally?" has one honest answer: it depends on the model size and how you quantize it. This calculator turns that into a concrete number — the VRAM, GPU tier, and system RAM you need to run a given model on your own hardware, whether that is a laptop, a workstation, or an on-premise server.

Estimate Your Local AI Hardware

How the Estimate Works

The estimate uses the standard rule of thumb for model memory. It is directional, not a guarantee, but it is close enough to plan a purchase or a cloud instance.

  1. Weights memory: Parameters (in billions) multiplied by bytes per weight. Full FP16 precision is 2 bytes per weight, 8-bit is 1 byte, and 4-bit quantization is about 0.5 bytes.
  2. Overhead: A multiplier for the KV cache, activations, and adapters. Plain inference adds roughly 20 percent; long-context RAG and LoRA fine-tuning add more.
  3. System RAM: A safe target is about 1.5 times your VRAM, with a 16 GB floor, so the model can load and the operating system stays responsive.
  4. GPU tier: The total VRAM maps to a class of card — consumer, prosumer, workstation, or data-center.
Quantization is the lever. Running a model at 4-bit instead of full precision cuts its memory footprint by about 4×, often with little quality loss. It is the single biggest factor in whether a model fits on hardware you can afford.

VRAM by Model Size (4-bit Quantized)

Most people run local LLMs at 4-bit quantization, which is the practical sweet spot. Here is roughly what each model size needs and the hardware that fits.

Model sizeVRAM (4-bit)Hardware that fits
3B–8B~4–6 GBMainstream laptop or RTX 4060 12GB
13B~8–10 GBRTX 4060 Ti 16GB
32B~18–22 GBRTX 4090 24GB
70B~40–48 GBRTX 6000 Ada 48GB or A100 80GB
120B+70 GB+Multi-GPU server (2×+ H100)

How to Run an LLM Locally

Running an LLM locally takes three things: hardware that fits the model, a runtime to serve it, and an open-weights model to load. Once the hardware is sized with the calculator above, the software side is straightforward.

  1. Pick a runtime: Tools like Ollama, LM Studio, or llama.cpp run models on a laptop or desktop with little setup. For a server, vLLM and TGI serve models to many users at once.
  2. Choose a model: Download a quantized open-weights model that fits your VRAM budget — a small Mistral, Llama, or Qwen model is a good start.
  3. Serve it: Point your app at the local endpoint. Most runtimes expose an OpenAI-compatible API, so existing code often works with a one-line URL change.

For the full deployment picture — on-premise versus private cloud, security, and cost — read our how to run open-weights models guide, or our private AI for business decision guide.

Best GPU for Local LLMs

The best GPU for running LLMs locally is the one with the most VRAM you can justify, because memory — not raw speed — is what decides which models you can run at all. NVIDIA cards are the smoothest path thanks to broad tooling support.

BudgetGPURuns (4-bit)
EntryRTX 4060 Ti 16GBUp to ~13B
EnthusiastRTX 4090 24GBUp to ~32B
WorkstationRTX 6000 Ada 48GBUp to ~70B
Data centerA100 / H100 80GB70B comfortably; larger with multi-GPU

Apple Silicon is a strong alternative for on-device use: because the GPU shares unified memory, a Mac with 64GB can run models that would need a 48GB discrete GPU. For business deployments serving many users, NVIDIA server GPUs remain the standard.


Frequently Asked Questions

  • As a rule of thumb, VRAM needed equals the number of parameters (in billions) times the bytes per weight, plus overhead. A 7B model at 4-bit needs roughly 4–5 GB; the same model at full FP16 precision needs about 16 GB. A 70B model at 4-bit needs around 40–48 GB. Use the calculator above for your exact model and use case.
  • It depends on the model size and how you quantize it. Small models (3B–8B) at 4-bit run on a mainstream consumer GPU like an RTX 4060 12GB. Mid-size models (13B–32B) want a 24GB card such as an RTX 4090. Large models (70B+) need a workstation or data-center GPU like an RTX 6000 Ada or an A100/H100.
  • Yes, for small models. A laptop with 16GB of unified memory or a 8GB+ discrete GPU can run a quantized 7B–8B model for chat and light tasks. Apple Silicon laptops with 32GB+ of unified memory can run larger models because the GPU shares system memory. Bigger models still need a desktop or server GPU.
  • For most people running local LLMs, an NVIDIA RTX 4090 (24GB) is the sweet spot for price versus capability — it runs models up to about 32B at 4-bit. On a budget, an RTX 4060 Ti 16GB handles 7B–13B models well. For business on-premise use with large models, workstation cards (RTX 6000 Ada 48GB) or data-center GPUs (A100/H100 80GB) are the standard.
  • No, but NVIDIA is the smoothest path because most tooling targets its CUDA platform. Apple Silicon (M-series) runs local models well through its unified memory and Metal, and is popular for on-device use. AMD GPUs work through ROCm but with more setup friction. For a business deploying at scale, NVIDIA remains the default.
  • A 70B-parameter model needs roughly 140 GB at full FP16 precision, about 70 GB at 8-bit, and around 40–48 GB at 4-bit quantization. That means a single 48GB workstation GPU or an 80GB data-center GPU for the quantized version, and multiple GPUs for full precision. Quantization is what makes large models practical to self-host.

Deciding whether to run AI in-house?

This calculator sizes the hardware. Our free assessment maps the whole decision — private cloud versus on-premise, model choice, cost versus a closed API — to your data, volume, and compliance needs.

Book a Consultation