Skip to main content
Back to Journal
AI EngineeringDeveloper Tools

Running LLMs Locally: Ollama and the Open Source Model Ecosystem

I started using Ollama about six months ago because I wanted to experiment with language models without paying for API calls. The setup was easier than I expected, the model quality surprised me, and the privacy benefits are real. But it is not a drop-in replacement for cloud APIs, and the hardware requirements are something you need to plan for.

Getting Started

Ollama is a tool that downloads, manages, and runs language models locally. Installation is a single command on Mac:

curl -fsSL https://ollama.com/install.sh | sh

After that, pulling and running a model is as simple as:

ollama pull llama3:8b
ollama run llama3:8b

This downloads the Llama 3 8B model (about 4.7GB) and drops you into an interactive chat session. The first run takes a minute or two for the download, but subsequent runs start in a few seconds.

The model library has grown considerably. Some of the models I use regularly:

  • Llama 3 8B: Good general-purpose model, runs well on 8GB RAM
  • Llama 3 70B: Much more capable, needs 40GB+ RAM or a beefy GPU
  • Mistral 7B: Fast and surprisingly capable for its size
  • Phi-3: Microsoft's small model, great for coding tasks
  • Gemma 2: Google's open model, good at instruction following
  • CodeGemma: Specifically tuned for code generation

GGUF and Quantization

Running a 70 billion parameter model in full precision would require over 140GB of RAM. That is not practical for most machines. Quantization compresses the model weights from 16-bit or 32-bit floating point down to 4-bit or 8-bit integers, dramatically reducing memory requirements with a modest quality tradeoff.

The GGUF format (created by the llama.cpp project) is the standard for quantized models. The naming convention tells you the quantization level:

  • Q4_K_M: 4-bit quantization, medium quality. Best balance of size and quality for most use cases.
  • Q5_K_M: 5-bit, slightly better quality, slightly larger.
  • Q8_0: 8-bit, near-original quality, roughly double the size of Q4.

For the 8B Llama 3 model, Q4_K_M brings it down to about 4.7GB. Q8 would be about 8.5GB. The quality difference between Q4 and Q8 is noticeable if you are looking for it, but for most practical tasks (summarization, code generation, Q&A), Q4_K_M is perfectly fine.

Hardware Requirements

The RAM/VRAM requirement is roughly: model file size + 1 to 2 GB overhead. So a 4.7GB model needs about 6GB of available memory. If you have a GPU, Ollama will use it automatically, and generation is significantly faster.

On my MacBook Pro with an M2 Pro (16GB unified memory), the 8B Llama 3 at Q4 runs at about 30 to 40 tokens per second. That is fast enough for interactive use. The 70B model at Q4 needs more memory than I have, so I run it on a desktop with 64GB RAM at about 8 tokens per second on CPU only. Usable, but you notice the wait.

If you have an NVIDIA GPU with 8GB+ VRAM, even the larger models become practical. A 3090 with 24GB VRAM can run 70B models at Q4 with respectable speed.

The API

Ollama exposes an OpenAI-compatible REST API on localhost:11434. This means you can point any tool that supports the OpenAI API format at your local Ollama instance.

const response = await fetch('http://localhost:11434/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3:8b',
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Explain WebAssembly in two sentences.' }
    ],
    temperature: 0.7,
  }),
});

const data = await response.json();
console.log(data.choices[0].message.content);

The OpenAI SDK also works with a base URL override:

const OpenAI = require('openai');

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // required but not validated
});

const completion = await client.chat.completions.create({
  model: 'llama3:8b',
  messages: [{ role: 'user', content: 'Hello!' }],
});

This compatibility means you can develop against a local model and switch to a cloud API for production by changing the base URL and API key. The code stays the same.

Modelfile Customization

Ollama lets you create custom model configurations with a Modelfile (similar to a Dockerfile):

FROM llama3:8b

PARAMETER temperature 0.3
PARAMETER top_p 0.9

SYSTEM "You are a senior software engineer. Provide concise, practical answers about code. Include code examples when relevant."

Build and run it:

ollama create code-assistant -f Modelfile
ollama run code-assistant

This is useful for creating purpose-specific assistants without fine-tuning. The system prompt and parameters are baked into the model configuration.

When Local Makes Sense

Privacy is the clearest use case. If you are working with proprietary code, customer data, or anything sensitive, running the model locally means the data never leaves your machine. No third-party API, no data retention policies to worry about.

Cost is another factor. If you make hundreds or thousands of API calls per day for development and testing, the cloud costs add up. A local model is free to run after the initial hardware investment.

Offline access is a nice bonus. I can use my local model on a plane, in a coffee shop with bad wifi, or anywhere else without internet access.

When Cloud Is Better

Quality. The best cloud models (GPT-4, the large variants from major providers) are still significantly better than the best open source models you can run locally, especially for complex reasoning, nuanced writing, and multi-step tasks.

Speed at scale. If you need to process thousands of requests per minute, cloud APIs with their GPU clusters will outperform anything you can run on a single machine.

Convenience. Cloud APIs require zero hardware management, no model downloads, and no memory optimization. You make a request and get a response.

My current workflow uses local models for development, experimentation, and private tasks. For production applications and tasks that require the highest quality output, I use cloud APIs. The two complement each other well.

ollamallamalocal-aiggufquantizationopen-source