What hardware do I need to run LLMs locally?

For 7B models: **8GB RAM minimum, 16GB comfortable**. A GPU is not required — Apple Silicon (M1/M2/M3) Macs run 7B models excellently via Metal. On Linux/Windows with an NVIDIA GPU (RTX 3060+), inference is significantly faster. For 13B+ models, 32GB RAM or a 12GB+ VRAM GPU is needed.

Is Ollama production-ready for serving multiple users?

For a single developer or small team — yes. For production serving at scale, use **vLLM** or **llama.cpp server** with proper load balancing. Ollama is excellent for local development, prototyping, and small internal tools. It's not designed for high-concurrency public-facing APIs.

Which model should I start with?

Start with **Llama 3.1 8B** for general tasks — it's fast, capable, and runs on most hardware. For coding specifically, **Qwen2.5-Coder 7B** outperforms larger general models. For document analysis and long context, **Mistral Nemo 12B** handles 128K context efficiently.

Blog/AI Tools

Running LLMs Locally: The Engineer's Practical Guide to Ollama

No API keys, no cloud costs, no data leaving your machine. Ollama makes running LLMs locally practical for engineers who want real AI integration.

| Beginner

Originally published on Company Blog on 6/18/2026

Rajath KumarEdge AI Engineer & Founder, Analog Data

2026-06-19·8 min read

Running LLMs Locally: The Engineer's Practical Guide to Ollama

Cloud LLM APIs are convenient until they're not. Rate limits hit at the wrong moment. API costs compound. And for any work involving proprietary code, customer data, or internal documents — you shouldn't be sending that to a third-party server anyway.

Ollama solves this. It's the cleanest way to run open-source LLMs locally, with an API that's compatible with the OpenAI SDK.

Install and First Run

bash

1# macOS / Linux
2curl -fsSL https://ollama.com/install.sh | sh
3
4# Pull and run Llama 3.1 8B (4.7 GB download)
5ollama run llama3.1
6
7# Or pull without immediately running
8ollama pull qwen2.5-coder:7b
9ollama pull mistral-nemo

That's it. The model runs as a local server on http://localhost:11434.

Using the REST API

Ollama exposes an HTTP API that's almost identical to OpenAI's:

python

1import requests
2import json
3
4def chat(prompt: str, model: str = "llama3.1") -> str:
5    response = requests.post(
6        "http://localhost:11434/api/chat",
7        json={
8            "model": model,
9            "messages": [{"role": "user", "content": prompt}],
10            "stream": False,
11        }
12    )
13    return response.json()["message"]["content"]
14
15# Use it
16result = chat("Explain FreeRTOS task priorities in one paragraph.")
17print(result)

Or use the OpenAI-compatible endpoint directly with the official SDK — zero code changes if you're migrating from OpenAI:

python

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="http://localhost:11434/v1",
5    api_key="ollama",   # required by SDK but ignored
6)
7
8response = client.chat.completions.create(
9    model="llama3.1",
10    messages=[
11        {"role": "system", "content": "You are an embedded systems expert."},
12        {"role": "user", "content": "What is the difference between IRAM and DRAM on ESP32?"},
13    ]
14)
15print(response.choices[0].message.content)

Streaming Responses

For interactive applications, stream tokens as they're generated:

python

1import ollama
2
3def stream_response(prompt: str):
4    stream = ollama.chat(
5        model="llama3.1",
6        messages=[{"role": "user", "content": prompt}],
7        stream=True,
8    )
9    for chunk in stream:
10        print(chunk["message"]["content"], end="", flush=True)
11    print()   # newline after completion
12
13stream_response("Write a FreeRTOS task that blinks an LED every 500ms.")

Building an Offline Code Review Tool

A practical use case — local code review with no cloud dependency:

python

1import ollama
2import sys
3from pathlib import Path
4
5SYSTEM_PROMPT = """You are an expert embedded systems code reviewer.
6Focus on: memory safety, stack overflow risks, ISR correctness, 
7FreeRTOS API usage, and production reliability.
8Be specific about line numbers and exact issues."""
9
10def review_file(file_path: str) -> None:
11    code = Path(file_path).read_text()
12    language = Path(file_path).suffix.lstrip(".")
13
14    response = ollama.chat(
15        model="qwen2.5-coder:7b",
16        messages=[
17            {"role": "system", "content": SYSTEM_PROMPT},
18            {
19                "role": "user",
20                "content": f"Review this {language} file:\n\n```{language}\n{code}\n```"
21            },
22        ],
23    )
24    print(response["message"]["content"])
25
26if __name__ == "__main__":
27    review_file(sys.argv[1])

Run it: python review.py src/sensor_task.c

Completely offline. Your code never leaves the machine.

Useful Models by Use Case

Task	Model	Size	Why
General Q&A	`llama3.1`	4.7 GB	Fast, balanced, good instruction following
Code generation	`qwen2.5-coder:7b`	4.7 GB	Best open-source coder at this size
Long documents	`mistral-nemo`	7.1 GB	128K context window
Embeddings	`nomic-embed-text`	274 MB	Fast local embeddings for RAG
Fast chat	`gemma2:2b`	1.6 GB	Runs on anything, surprisingly capable

Modelfile: Custom System Prompts

Bake a system prompt into a custom model variant so you don't repeat it every call:

dockerfile

1# embedded-expert.Modelfile
2FROM qwen2.5-coder:7b
3
4SYSTEM """
5You are an expert in embedded systems, ESP32, FreeRTOS, ESP-IDF, and Edge AI.
6You answer in the style of a senior firmware engineer with 10+ years of production experience.
7Prefer concise, production-focused answers with real code examples in C.
8"""
9
10PARAMETER temperature 0.3
11PARAMETER num_ctx 8192

bash

1ollama create embedded-expert -f embedded-expert.Modelfile
2ollama run embedded-expert "How do I handle heap fragmentation in ESP-IDF?"

Key Takeaways

Ollama + OpenAI SDK = drop-in local replacement — change one URL, keep all your code
Qwen2.5-Coder beats larger general models for code — specialisation matters
Modelfiles let you bake system prompts — remove boilerplate from every API call
No GPU required for 7B models — Apple Silicon runs these excellently via Metal
Your data stays local — the right choice for proprietary code, internal documents, and anything sensitive

Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

#Ollama #Local LLM #AI Tools #LLM #Privacy

Frequently Asked Questions

Quick answers to common questions

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

Bengaluru, India

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.

Running LLMs Locally: The Engineer's Practical Guide to Ollama

Install and First Run

Using the REST API

Streaming Responses

Building an Offline Code Review Tool

Useful Models by Use Case

Modelfile: Custom System Prompts

Key Takeaways

Go from Arduino to Production Firmware

More in AI Tools

Building a Production FastAPI App with SQLAlchemy

Building a RAG Pipeline for Engineers: From PDF to Answers in 50 Lines

Building AI Agents with LangGraph: State Machines for LLM Workflows

LoRA Fine-Tuning for Edge Deployment: Shrink, Quantize, Ship