Running LLMs Locally: The Engineer's Practical Guide to Ollama
No API keys, no cloud costs, no data leaving your machine. Ollama makes running LLMs locally practical for engineers who want real AI integration.
Cloud LLM APIs are convenient until they're not. Rate limits hit at the wrong moment. API costs compound. And for any work involving proprietary code, customer data, or internal documents — you shouldn't be sending that to a third-party server anyway.
Ollama solves this. It's the cleanest way to run open-source LLMs locally, with an API that's compatible with the OpenAI SDK.
Install and First Run
1# macOS / Linux
2curl -fsSL https://ollama.com/install.sh | sh
3
4# Pull and run Llama 3.1 8B (4.7 GB download)
5ollama run llama3.1
6
7# Or pull without immediately running
8ollama pull qwen2.5-coder:7b
9ollama pull mistral-nemoThat's it. The model runs as a local server on http://localhost:11434.
Using the REST API
Ollama exposes an HTTP API that's almost identical to OpenAI's:
1import requests
2import json
3
4def chat(prompt: str, model: str = "llama3.1") -> str:
5 response = requests.post(
6 "http://localhost:11434/api/chat",
7 json={
8 "model": model,
9 "messages": [{"role": "user", "content": prompt}],
10 "stream": False,
11 }
12 )
13 return response.json()["message"]["content"]
14
15# Use it
16result = chat("Explain FreeRTOS task priorities in one paragraph.")
17print(result)Or use the OpenAI-compatible endpoint directly with the official SDK — zero code changes if you're migrating from OpenAI:
1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:11434/v1",
5 api_key="ollama", # required by SDK but ignored
6)
7
8response = client.chat.completions.create(
9 model="llama3.1",
10 messages=[
11 {"role": "system", "content": "You are an embedded systems expert."},
12 {"role": "user", "content": "What is the difference between IRAM and DRAM on ESP32?"},
13 ]
14)
15print(response.choices[0].message.content)Streaming Responses
For interactive applications, stream tokens as they're generated:
1import ollama
2
3def stream_response(prompt: str):
4 stream = ollama.chat(
5 model="llama3.1",
6 messages=[{"role": "user", "content": prompt}],
7 stream=True,
8 )
9 for chunk in stream:
10 print(chunk["message"]["content"], end="", flush=True)
11 print() # newline after completion
12
13stream_response("Write a FreeRTOS task that blinks an LED every 500ms.")Building an Offline Code Review Tool
A practical use case — local code review with no cloud dependency:
1import ollama
2import sys
3from pathlib import Path
4
5SYSTEM_PROMPT = """You are an expert embedded systems code reviewer.
6Focus on: memory safety, stack overflow risks, ISR correctness,
7FreeRTOS API usage, and production reliability.
8Be specific about line numbers and exact issues."""
9
10def review_file(file_path: str) -> None:
11 code = Path(file_path).read_text()
12 language = Path(file_path).suffix.lstrip(".")
13
14 response = ollama.chat(
15 model="qwen2.5-coder:7b",
16 messages=[
17 {"role": "system", "content": SYSTEM_PROMPT},
18 {
19 "role": "user",
20 "content": f"Review this {language} file:\n\n```{language}\n{code}\n```"
21 },
22 ],
23 )
24 print(response["message"]["content"])
25
26if __name__ == "__main__":
27 review_file(sys.argv[1])Run it: python review.py src/sensor_task.c
Completely offline. Your code never leaves the machine.
Useful Models by Use Case
| Task | Model | Size | Why |
|---|---|---|---|
| General Q&A | llama3.1 | 4.7 GB | Fast, balanced, good instruction following |
| Code generation | qwen2.5-coder:7b | 4.7 GB | Best open-source coder at this size |
| Long documents | mistral-nemo | 7.1 GB | 128K context window |
| Embeddings | nomic-embed-text | 274 MB | Fast local embeddings for RAG |
| Fast chat | gemma2:2b | 1.6 GB | Runs on anything, surprisingly capable |
Modelfile: Custom System Prompts
Bake a system prompt into a custom model variant so you don't repeat it every call:
1# embedded-expert.Modelfile
2FROM qwen2.5-coder:7b
3
4SYSTEM """
5You are an expert in embedded systems, ESP32, FreeRTOS, ESP-IDF, and Edge AI.
6You answer in the style of a senior firmware engineer with 10+ years of production experience.
7Prefer concise, production-focused answers with real code examples in C.
8"""
9
10PARAMETER temperature 0.3
11PARAMETER num_ctx 81921ollama create embedded-expert -f embedded-expert.Modelfile
2ollama run embedded-expert "How do I handle heap fragmentation in ESP-IDF?"Key Takeaways
- Ollama + OpenAI SDK = drop-in local replacement — change one URL, keep all your code
- Qwen2.5-Coder beats larger general models for code — specialisation matters
- Modelfiles let you bake system prompts — remove boilerplate from every API call
- No GPU required for 7B models — Apple Silicon runs these excellently via Metal
- Your data stays local — the right choice for proprietary code, internal documents, and anything sensitive
Go from Arduino to Production Firmware
The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.
Frequently Asked Questions
Quick answers to common questions

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.