What's the difference between RAG and fine-tuning?

**Fine-tuning** updates the model's weights — it teaches the model new patterns permanently, but is expensive (compute + data + time) and the knowledge becomes static. **RAG** keeps the model frozen and retrieves relevant documents at query time — it's cheaper, always up-to-date, and you can update the knowledge base without retraining. For most engineering use cases (Q&A over docs, code search), RAG is the right answer.

Which embedding model should I use?

For local/free use: **nomic-embed-text** via Ollama (274MB, excellent quality). For production with a budget: **text-embedding-3-small** from OpenAI (cheapest, fast, great quality). For multilingual documents: **multilingual-e5-large**. The embedding model matters more than people think — a better embedding = better retrieval = better answers.

How many chunks should I split my documents into?

Chunk size is empirical — test it. A good starting point: **512 tokens per chunk with 50-token overlap**. Too small = individual chunks lack context. Too large = chunks contain multiple unrelated topics and retrieval quality drops. For code documentation, chunk by function/class. For prose, chunk by paragraph or fixed token count.

Blog/Generative AI

Building a RAG Pipeline for Engineers: From PDF to Answers in 50 Lines

RAG sounds complicated. It's not. Here's how to build a working pipeline that answers questions from your own documents in under 50 lines of Python.

| Intermediate

Rajath KumarEdge AI Engineer & Founder, Analog Data

2026-06-17·7 min read

Building a RAG Pipeline for Engineers: From PDF to Answers in 50 Lines

Retrieval-Augmented Generation is the pattern that makes LLMs actually useful for your specific domain. Instead of asking a general model about your ESP-IDF datasheet or internal API docs, you give it the relevant pages at query time. It answers from your content, not its training data.

Here's the minimal working implementation.

How RAG Works (in 30 seconds)

text

1Your documents (PDFs, docs, code)
2        ↓
3   [Chunking]           Split into ~512 token pieces
4        ↓
5   [Embedding]          Turn each chunk into a vector (numbers)
6        ↓
7   [Vector store]       Store vectors + original text
8
9At query time:
10   User question → embed → find similar chunks → send chunks + question to LLM → answer

The LLM never sees your full document library. It only sees the 3–5 most relevant chunks for each question.

The Minimal Implementation

python

1pip install llama-index-core llama-index-llms-ollama llama-index-embeddings-ollama

python

1from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
2from llama_index.llms.ollama import Ollama
3from llama_index.embeddings.ollama import OllamaEmbedding
4
5# 1. Configure local LLM and embeddings (fully offline)
6Settings.llm = Ollama(model="llama3.1", request_timeout=60.0)
7Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
8
9# 2. Load documents from a folder
10documents = SimpleDirectoryReader("./docs").load_data()
11# Supports: PDF, DOCX, TXT, Markdown, HTML, CSV, code files
12
13# 3. Build the index (chunks + embeddings)
14index = VectorStoreIndex.from_documents(documents)
15
16# 4. Create a query engine
17query_engine = index.as_query_engine(similarity_top_k=4)
18
19# 5. Ask questions
20response = query_engine.query(
21    "What is the maximum stack size recommended for the MQTT task?"
22)
23print(response)

That's it. 5 lines of real logic (excluding config). The rest is wiring.

Persisting the Index

Re-embedding your documents on every run is wasteful. Persist to disk:

python

1import os
2from llama_index.core import (
3    VectorStoreIndex,
4    SimpleDirectoryReader,
5    StorageContext,
6    load_index_from_storage,
7    Settings,
8)
9
10INDEX_DIR = "./index_storage"
11
12def get_or_build_index(docs_dir: str) -> VectorStoreIndex:
13    if os.path.exists(INDEX_DIR):
14        print("Loading existing index...")
15        storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
16        return load_index_from_storage(storage_context)
17    else:
18        print("Building index from documents...")
19        documents = SimpleDirectoryReader(docs_dir).load_data()
20        index = VectorStoreIndex.from_documents(documents)
21        index.storage_context.persist(persist_dir=INDEX_DIR)
22        return index
23
24index = get_or_build_index("./docs")
25query_engine = index.as_query_engine(similarity_top_k=4)

First run: embeds and saves. Every subsequent run: loads in seconds.

Streaming + Source References

For production tools, stream the response and show which documents were used:

python

1from llama_index.core.response.schema import StreamingResponse
2
3query_engine = index.as_query_engine(
4    similarity_top_k=4,
5    streaming=True,
6    response_mode="compact",   # or "tree_summarize" for long answers
7)
8
9response = query_engine.query("Explain the device shadow update flow in AWS IoT Core.")
10
11# Stream tokens
12for token in response.response_gen:
13    print(token, end="", flush=True)
14print()
15
16# Show source nodes (which chunks were used)
17print("\n--- Sources ---")
18for node in response.source_nodes:
19    print(f"  [{node.score:.2f}] {node.metadata.get('file_name', 'unknown')}")
20    print(f"       ...{node.text[:100]}...")

Practical Use Cases for Engineers

python

1# Index your entire project documentation
2docs_dir = "./esp-idf-docs"       # ESP-IDF API reference PDFs
3                                   # or your company's internal docs
4
5# Ask implementation questions
6queries = [
7    "How do I configure the watchdog timer in ESP-IDF?",
8    "What's the correct way to use xQueueSendFromISR?",
9    "List all the partition table options for OTA updates",
10    "How do I enable PSRAM on ESP32-S3?",
11]
12
13for q in queries:
14    print(f"Q: {q}")
15    print(f"A: {query_engine.query(q)}\n")

This pattern works for:

ESP-IDF docs — query the full API reference without scrolling
Datasheets — ask register-level questions directly
Internal runbooks — instant answers from your team's documentation
Codebase Q&A — index your C/Python source files and ask about implementation

Key Takeaways

RAG > fine-tuning for most engineering use cases — cheaper, always up-to-date, no GPU required
nomic-embed-text via Ollama — the best fully-local embedding model, 274MB
Persist your index — never re-embed the same documents twice
Show source nodes — engineers don't trust black-box answers; citations build trust
Chunk size matters — 512 tokens with 50-token overlap is a safe starting point

Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

#RAG #LLM #Generative AI #Python #AI Tools

Frequently Asked Questions

Quick answers to common questions

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

Bengaluru, India

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.