How to Run Open-Weights Models: Local, Self-Hosted, and Hosted Options

A practical map of the tools that let your business run open-weights AI — from a laptop trial to a production API — and how to choose between them.

If you've decided to use an open-weights model, the next question is the practical one: how do you actually run it? Learning how to run open-weights models is less intimidating than it sounds. The ecosystem has matured into a clean spectrum, from a free desktop app you install in five minutes to a GPU-backed server that powers a production app behind your own internal API.

This guide is written for business buyers and operators, not infrastructure engineers. We won't walk through command-line tutorials. Instead, we'll explain the four ways to run open-weights AI locally or self-hosted, what each is good for, what it costs you in effort and hardware, and how to pick based on a single question: what are you actually trying to do?

The short version: if you just want to try a model, a desktop app gets you there today. If your team writes code, a local coding assistant gives you a private, free alternative to cloud tools. If you're building a real product, a self-hosted inference server is the production answer. And if you want zero infrastructure, hosted open-weights APIs let someone else run the hardware while you keep the flexibility of open models.


The four ways to run open-weights models

Every option for running open-weights AI falls into one of four buckets. They line up neatly from easiest to most scalable, and most companies move along this path over time — trying a model on a laptop first, then putting it to work, then hardening it for production.

You don't have to pick the most advanced option. The right choice is the simplest one that meets your goal. Here's the map before we go deeper on each.

  • Desktop apps (easiest): Install a free app, download a model, chat with it on your own machine. Best for evaluating models and private one-off use.
  • Local coding assistants (for teams that build): Plug a local model into your code editor for a private, free coding helper that never sends your code to the cloud.
  • Self-hosted inference servers (production): Run a model on your own GPU server and expose it behind an internal API your apps can call. Best for real products and high volume.
  • Hosted open-weights APIs (no infra): Call an open model through a provider's API — open-model flexibility without running any hardware yourself.
Rule of thumb: match the tool to the goal, not to its sophistication. Most teams overspend by jumping straight to a GPU server when a desktop app would have answered their question for free.

Run open-weights AI locally with a desktop app

The fastest way to run open-weights AI locally is a desktop app. These are free installers — like any other program — that download a model file to your computer and let you chat with it through a normal window. Nothing leaves your machine, and you don't need a developer to set it up.

Two tools dominate this category. Ollama is a lightweight runner that's become the default for quick experimentation; it's free, open-source, and works on Mac, Windows, and Linux. LM Studio is a polished graphical app with a built-in model browser, so a non-technical user can search for a model, click download, and start chatting without ever touching a terminal.

This is the right starting point for almost everyone. Use it to evaluate whether a given open model is good enough for your use case before you invest in anything bigger, or to give a small team private access to AI on hardware they already own.

  • Best for: evaluating models, private personal use, small-team trials with no budget.
  • Effort: minutes. Download, install, pick a model, go.
  • Cost: free software; you only pay for the computer you already have.
  • Limit: one user at a time per machine, and model size is capped by your laptop's memory.

A private, free coding assistant for teams that build

If your team writes software, there's a high-value way to run open-weights models that pays for itself quickly: a local coding assistant. These plug an open model directly into your code editor, giving developers autocomplete and chat-based help — the same experience as a commercial AI coding tool, but running on a model you control.

The appeal for a business is twofold. First, privacy: your proprietary source code never leaves your network, which matters in regulated industries or when you've signed strict client confidentiality terms. Second, cost: once the model is running, the assistance is effectively free, with no per-seat subscription.

Continue.dev is the most common entry point here — an open-source extension for VS Code and JetBrains that connects to a local model (for example, one served by Ollama) so your editor's AI features run privately. Cursor, a popular AI-first editor, can also be pointed at custom or self-hosted models for teams that want that route. Both let you keep the modern coding experience without the cloud dependency.

  • Best for: development teams that want AI coding help without sending code to a third party.
  • How it works: a code-editor extension talks to a model running locally or on your own server.
  • Payoff: a private coding assistant with no per-seat license fees once it's running.

Self-hosted inference servers for production

When you're putting an open model into a real product — a customer-facing chatbot, a document-processing pipeline, an internal tool used by many people — a desktop app no longer fits. You need a self-hosted inference server: software that runs the model on a dedicated machine and exposes it behind an internal API your applications can call, the same way they'd call a cloud service.

vLLM is the standard choice for serious production serving on GPUs. It's built for many simultaneous users and high throughput, which is what you need when an app makes thousands of requests. llama.cpp (and its built-in llama-server) is the lighter-weight option, valued for running efficiently on a wide range of hardware, including setups without expensive GPUs.

This tier requires more setup and someone with technical ownership, but it's what gives you full control: your data stays on your infrastructure, costs are predictable, and you're not subject to a vendor's rate limits or pricing changes. This is usually the point where a consulting partner earns their keep — sizing the hardware and standing up the server correctly is where most teams get stuck.

  • Best for: production apps, high request volume, strict data-residency requirements.
  • vLLM: high-throughput serving for many concurrent users, GPU-focused.
  • llama.cpp / llama-server: efficient, flexible, runs on more modest hardware.
  • Trade-off: more setup and ongoing ownership in exchange for control and predictable cost.

Hosted open-weights APIs: skip the infrastructure

There's a middle path that many teams miss: you can use open-weights models without running any hardware at all. A number of providers host popular open models and offer them through a simple API. You get the flexibility and lower per-token cost of open models, while the provider handles the GPUs, scaling, and uptime.

This is the pragmatic choice when you want open-model economics and the freedom to switch models, but you don't want to own infrastructure — or you're not yet sure your volume justifies it. It's also a low-risk way to prototype a production app: build against a hosted open-weights API now, and move to your own self-hosted server later if data control or scale demands it, often with minimal code changes because the APIs look similar.

The catch is that your data does pass through a third party, so this isn't the right fit when strict privacy or data-residency is the whole reason you chose open models. For everyone else, it's frequently the best balance of speed, cost, and effort.

  • Best for: teams that want open-model flexibility without managing servers.
  • Upside: no hardware, pay-per-use, easy to switch models.
  • Watch-out: your data leaves your environment, so weigh it against your privacy requirements.

The hardware reality: what runs where

The single biggest factor in whether you can run a model locally is its size, measured in parameters (billions of them). Bigger models are smarter but need far more memory and computing power. This is the practical line that decides which of the four paths is even open to you.

Small and mid-size models run comfortably on a good laptop, and a modern Mac with plenty of unified memory is particularly well-suited because the chip and memory are shared. These are enough for a surprising amount of real work — drafting, summarizing, classification, and coding help.

The largest open models are a different story. They need dedicated GPUs, and the most capable ones may need multiple high-end GPUs working together. That's not laptop territory; it's the self-hosted-server or hosted-API conversation. When someone quotes a model as needing serious hardware, this is what they mean.

  • Small models: run on a good laptop or a Mac with ample memory — fine for most everyday tasks.
  • Large models: need one or more dedicated GPUs — that means a self-hosted server or a hosted API.
  • Quick gut check: if a model's size makes your laptop struggle, that's your signal to move up the spectrum.
Don't buy a GPU server to answer 'is this model good enough for us?' Test the model first on a laptop or a hosted API. Only commit to hardware once a real workload proves it's worth it.

How to pick, and the fastest way to start

Cut through the options with one question — what are you trying to do right now? — and the choice usually makes itself.

The fastest way to start, for almost any team, is to install a desktop app like LM Studio or Ollama and try a model on a machine you already own. It costs nothing, takes minutes, and answers the most important question — is this model good enough for our use case? — before you spend a dollar on infrastructure. You can always graduate to a server or a hosted API once a real workload proves the value.

  • Just want to try it? Use a desktop app (Ollama, LM Studio). Free, minutes to set up.
  • Want private coding help for your team? Use a local coding assistant (Continue.dev in VS Code).
  • Building a production app? Stand up a self-hosted inference server (vLLM or llama.cpp) behind an internal API.
  • Want open models with zero infrastructure? Use a hosted open-weights API.

Conclusion: start small, scale when it's earned

Learning how to run open-weights models comes down to matching the tool to your goal, not chasing the most advanced setup. The spectrum runs from a free desktop app for trying a model, to a private coding assistant for teams that build, to a self-hosted inference server for production, to a hosted API when you'd rather skip the infrastructure entirely.

Start at the easy end. Run open-weights AI locally on a laptop, confirm the model does the job, and only move up to a GPU server or hosted API once a real workload justifies it. That sequence keeps your spending tied to proven value — and it's exactly the path we use when helping clients adopt open models without overbuilding.

Frequently Asked Questions

  • A free desktop app. Ollama or LM Studio installs like any other program, downloads a model to your computer, and lets you chat with it through a normal window — no developer or terminal required. It runs entirely on your machine, so nothing leaves your network. This is the best way to evaluate a model before investing in anything bigger.
  • Yes, for small and mid-size models. A good laptop — especially a modern Mac with ample unified memory — can run models capable of drafting, summarizing, classification, and coding help. The largest open models need dedicated GPUs and belong on a self-hosted server or a hosted API, not a laptop.
  • Use a local coding assistant such as Continue.dev, an open-source extension for VS Code and JetBrains that connects your editor to an open model running locally (for example via Ollama). Your source code never leaves your network, and there are no per-seat subscription fees once it's running. Cursor can also be pointed at custom or self-hosted models.
  • A self-hosted inference server. vLLM is the standard for high-throughput serving to many concurrent users on GPUs, while llama.cpp (with llama-server) is a lighter option that runs on more modest hardware. Both expose the model behind an internal API your applications call like any cloud service, keeping data on your own infrastructure.
  • No. Hosted open-weights APIs let providers run the GPUs while you call popular open models over a simple API. You get open-model flexibility and pay-per-use pricing without managing servers. The trade-off is that your data passes through a third party, so it's not the right fit if strict privacy is your main reason for choosing open models.
  • Install a desktop app like Ollama or LM Studio and try a model on a machine you already own. It's free, takes minutes, and answers the key question — is this model good enough for us? — before you spend anything on infrastructure. Move to a server or hosted API only once a real workload proves the value.

Not sure which path fits your team?

Layer3 Labs helps small and mid-size businesses choose, deploy, and run open-weights AI — from a first laptop trial to a production inference server behind your own API. We size the hardware, stand up the tooling, and keep your data where you want it.

Book a free open-weights consult