50% OFF

ESP32-IDF Workshop

Blog/AI Tools

Running LLMs Locally: The Engineer's Practical Guide to Ollama

No API keys, no cloud costs, no data leaving your machine. Ollama makes running LLMs locally practical for engineers who want real AI integration.

| Beginner
Originally published on Company Blog on 6/18/2026
Rajath Kumar
Rajath KumarEdge AI Engineer & Founder, Analog Data
2026-06-19·8 min read
Running LLMs Locally: The Engineer's Practical Guide to Ollama

Cloud LLM APIs are convenient until they're not. Rate limits hit at the wrong moment. API costs compound. And for any work involving proprietary code, customer data, or internal documents — you shouldn't be sending that to a third-party server anyway.

Ollama solves this. It's the cleanest way to run open-source LLMs locally, with an API that's compatible with the OpenAI SDK.


Install and First Run

bash
1# macOS / Linux
2curl -fsSL https://ollama.com/install.sh | sh
3
4# Pull and run Llama 3.1 8B (4.7 GB download)
5ollama run llama3.1
6
7# Or pull without immediately running
8ollama pull qwen2.5-coder:7b
9ollama pull mistral-nemo

That's it. The model runs as a local server on http://localhost:11434.


Using the REST API

Ollama exposes an HTTP API that's almost identical to OpenAI's:

python
1import requests
2import json
3
4def chat(prompt: str, model: str = "llama3.1") -> str:
5    response = requests.post(
6        "http://localhost:11434/api/chat",
7        json={
8            "model": model,
9            "messages": [{"role": "user", "content": prompt}],
10            "stream": False,
11        }
12    )
13    return response.json()["message"]["content"]
14
15# Use it
16result = chat("Explain FreeRTOS task priorities in one paragraph.")
17print(result)

Or use the OpenAI-compatible endpoint directly with the official SDK — zero code changes if you're migrating from OpenAI:

python
1from openai import OpenAI
2
3client = OpenAI(
4    base_url="http://localhost:11434/v1",
5    api_key="ollama",   # required by SDK but ignored
6)
7
8response = client.chat.completions.create(
9    model="llama3.1",
10    messages=[
11        {"role": "system", "content": "You are an embedded systems expert."},
12        {"role": "user", "content": "What is the difference between IRAM and DRAM on ESP32?"},
13    ]
14)
15print(response.choices[0].message.content)

Streaming Responses

For interactive applications, stream tokens as they're generated:

python
1import ollama
2
3def stream_response(prompt: str):
4    stream = ollama.chat(
5        model="llama3.1",
6        messages=[{"role": "user", "content": prompt}],
7        stream=True,
8    )
9    for chunk in stream:
10        print(chunk["message"]["content"], end="", flush=True)
11    print()   # newline after completion
12
13stream_response("Write a FreeRTOS task that blinks an LED every 500ms.")

Building an Offline Code Review Tool

A practical use case — local code review with no cloud dependency:

python
1import ollama
2import sys
3from pathlib import Path
4
5SYSTEM_PROMPT = """You are an expert embedded systems code reviewer.
6Focus on: memory safety, stack overflow risks, ISR correctness, 
7FreeRTOS API usage, and production reliability.
8Be specific about line numbers and exact issues."""
9
10def review_file(file_path: str) -> None:
11    code = Path(file_path).read_text()
12    language = Path(file_path).suffix.lstrip(".")
13
14    response = ollama.chat(
15        model="qwen2.5-coder:7b",
16        messages=[
17            {"role": "system", "content": SYSTEM_PROMPT},
18            {
19                "role": "user",
20                "content": f"Review this {language} file:\n\n```{language}\n{code}\n```"
21            },
22        ],
23    )
24    print(response["message"]["content"])
25
26if __name__ == "__main__":
27    review_file(sys.argv[1])

Run it: python review.py src/sensor_task.c

Completely offline. Your code never leaves the machine.


Useful Models by Use Case

TaskModelSizeWhy
General Q&Allama3.14.7 GBFast, balanced, good instruction following
Code generationqwen2.5-coder:7b4.7 GBBest open-source coder at this size
Long documentsmistral-nemo7.1 GB128K context window
Embeddingsnomic-embed-text274 MBFast local embeddings for RAG
Fast chatgemma2:2b1.6 GBRuns on anything, surprisingly capable

Modelfile: Custom System Prompts

Bake a system prompt into a custom model variant so you don't repeat it every call:

dockerfile
1# embedded-expert.Modelfile
2FROM qwen2.5-coder:7b
3
4SYSTEM """
5You are an expert in embedded systems, ESP32, FreeRTOS, ESP-IDF, and Edge AI.
6You answer in the style of a senior firmware engineer with 10+ years of production experience.
7Prefer concise, production-focused answers with real code examples in C.
8"""
9
10PARAMETER temperature 0.3
11PARAMETER num_ctx 8192
bash
1ollama create embedded-expert -f embedded-expert.Modelfile
2ollama run embedded-expert "How do I handle heap fragmentation in ESP-IDF?"

Key Takeaways

  1. Ollama + OpenAI SDK = drop-in local replacement — change one URL, keep all your code
  2. Qwen2.5-Coder beats larger general models for code — specialisation matters
  3. Modelfiles let you bake system prompts — remove boilerplate from every API call
  4. No GPU required for 7B models — Apple Silicon runs these excellently via Metal
  5. Your data stays local — the right choice for proprietary code, internal documents, and anything sensitive
Share
Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

Frequently Asked Questions

Quick answers to common questions

Rajath Kumar

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.

More in AI Tools