Building a RAG Pipeline for Engineers: From PDF to Answers in 50 Lines
RAG sounds complicated. It's not. Here's how to build a working pipeline that answers questions from your own documents in under 50 lines of Python.
Retrieval-Augmented Generation is the pattern that makes LLMs actually useful for your specific domain. Instead of asking a general model about your ESP-IDF datasheet or internal API docs, you give it the relevant pages at query time. It answers from your content, not its training data.
Here's the minimal working implementation.
How RAG Works (in 30 seconds)
1Your documents (PDFs, docs, code)
2 ↓
3 [Chunking] Split into ~512 token pieces
4 ↓
5 [Embedding] Turn each chunk into a vector (numbers)
6 ↓
7 [Vector store] Store vectors + original text
8
9At query time:
10 User question → embed → find similar chunks → send chunks + question to LLM → answerThe LLM never sees your full document library. It only sees the 3–5 most relevant chunks for each question.
The Minimal Implementation
1pip install llama-index-core llama-index-llms-ollama llama-index-embeddings-ollama1from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
2from llama_index.llms.ollama import Ollama
3from llama_index.embeddings.ollama import OllamaEmbedding
4
5# 1. Configure local LLM and embeddings (fully offline)
6Settings.llm = Ollama(model="llama3.1", request_timeout=60.0)
7Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
8
9# 2. Load documents from a folder
10documents = SimpleDirectoryReader("./docs").load_data()
11# Supports: PDF, DOCX, TXT, Markdown, HTML, CSV, code files
12
13# 3. Build the index (chunks + embeddings)
14index = VectorStoreIndex.from_documents(documents)
15
16# 4. Create a query engine
17query_engine = index.as_query_engine(similarity_top_k=4)
18
19# 5. Ask questions
20response = query_engine.query(
21 "What is the maximum stack size recommended for the MQTT task?"
22)
23print(response)That's it. 5 lines of real logic (excluding config). The rest is wiring.
Persisting the Index
Re-embedding your documents on every run is wasteful. Persist to disk:
1import os
2from llama_index.core import (
3 VectorStoreIndex,
4 SimpleDirectoryReader,
5 StorageContext,
6 load_index_from_storage,
7 Settings,
8)
9
10INDEX_DIR = "./index_storage"
11
12def get_or_build_index(docs_dir: str) -> VectorStoreIndex:
13 if os.path.exists(INDEX_DIR):
14 print("Loading existing index...")
15 storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
16 return load_index_from_storage(storage_context)
17 else:
18 print("Building index from documents...")
19 documents = SimpleDirectoryReader(docs_dir).load_data()
20 index = VectorStoreIndex.from_documents(documents)
21 index.storage_context.persist(persist_dir=INDEX_DIR)
22 return index
23
24index = get_or_build_index("./docs")
25query_engine = index.as_query_engine(similarity_top_k=4)First run: embeds and saves. Every subsequent run: loads in seconds.
Streaming + Source References
For production tools, stream the response and show which documents were used:
1from llama_index.core.response.schema import StreamingResponse
2
3query_engine = index.as_query_engine(
4 similarity_top_k=4,
5 streaming=True,
6 response_mode="compact", # or "tree_summarize" for long answers
7)
8
9response = query_engine.query("Explain the device shadow update flow in AWS IoT Core.")
10
11# Stream tokens
12for token in response.response_gen:
13 print(token, end="", flush=True)
14print()
15
16# Show source nodes (which chunks were used)
17print("\n--- Sources ---")
18for node in response.source_nodes:
19 print(f" [{node.score:.2f}] {node.metadata.get('file_name', 'unknown')}")
20 print(f" ...{node.text[:100]}...")Practical Use Cases for Engineers
1# Index your entire project documentation
2docs_dir = "./esp-idf-docs" # ESP-IDF API reference PDFs
3 # or your company's internal docs
4
5# Ask implementation questions
6queries = [
7 "How do I configure the watchdog timer in ESP-IDF?",
8 "What's the correct way to use xQueueSendFromISR?",
9 "List all the partition table options for OTA updates",
10 "How do I enable PSRAM on ESP32-S3?",
11]
12
13for q in queries:
14 print(f"Q: {q}")
15 print(f"A: {query_engine.query(q)}\n")This pattern works for:
- ESP-IDF docs — query the full API reference without scrolling
- Datasheets — ask register-level questions directly
- Internal runbooks — instant answers from your team's documentation
- Codebase Q&A — index your C/Python source files and ask about implementation
Key Takeaways
- RAG > fine-tuning for most engineering use cases — cheaper, always up-to-date, no GPU required
- nomic-embed-text via Ollama — the best fully-local embedding model, 274MB
- Persist your index — never re-embed the same documents twice
- Show source nodes — engineers don't trust black-box answers; citations build trust
- Chunk size matters — 512 tokens with 50-token overlap is a safe starting point
Go from Arduino to Production Firmware
The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.
Frequently Asked Questions
Quick answers to common questions

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.