Advanced RAG: Hybrid Search, Reranking, and Citation for Production
Basic RAG retrieves and generates. Production RAG uses hybrid search, cross-encoder reranking, and grounded citations. Here's how to build the latter.
Why Basic RAG Fails in Production
You followed the tutorial: embed documents, store in a vector DB, retrieve top-k, generate. It works on the demo. Then you put it in production and:
- Retrieval misses obvious matches because the user used different terminology
- Results are irrelevant because cosine similarity ≠ relevance
- The LLM hallucinates citations that don't exist in the source
- Latency is terrible because you're reranking 100 documents
Production RAG needs three things basic RAG doesn't have:
- Hybrid search — combine keyword + semantic retrieval
- Cross-encoder reranking — actually measure relevance, not just similarity
- Grounded citations — force the LLM to cite real chunks
1. Hybrid Search: BM25 + Vector Search
Keyword search (BM25) catches exact matches. Vector search catches semantic matches. Together they catch everything.
1from rank_bm25 import BM25Okapi
2from sentence_transformers import SentenceTransformer
3import numpy as np
4
5class HybridRetriever:
6 def __init__(self, documents, embed_model="all-MiniLM-L6-v2"):
7 self.documents = documents
8 self.bm25 = BM25Okapi([doc.split() for doc in documents])
9 self.encoder = SentenceTransformer(embed_model)
10 self.embeddings = self.encoder.encode(documents, normalize_embeddings=True)
11
12 def search(self, query, top_k=20, alpha=0.5):
13 # BM25 scores
14 bm25_scores = self.bm25.get_scores(query.split())
15 bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-9)
16
17 # Vector scores
18 query_emb = self.encoder.encode([query], normalize_embeddings=True)
19 vector_scores = (self.embeddings @ query_emb.T).flatten()
20
21 # Combine
22 hybrid_scores = alpha * bm25_scores + (1 - alpha) * vector_scores
23 top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
24
25 return [(self.documents[i], hybrid_scores[i]) for i in top_indices]The alpha parameter controls the blend:
alpha=1.0→ pure BM25 (keyword matching)alpha=0.0→ pure vector search (semantic matching)alpha=0.5→ balanced (default for most use cases)
Tune it per use case. Code documentation? Lean toward BM25 (alpha=0.7). Natural language Q&A? Lean toward vector (alpha=0.3).
2. Cross-Encoder Reranking
Retrieval gives you candidates. Reranking gives you the right candidates.
A bi-encoder (what you used for retrieval) encodes query and document separately — fast but coarse. A cross-encoder encodes them together — slow but precise.
1from sentence_transformers import CrossEncoder
2
3class Reranker:
4 def __init__(self, model="cross-encoder/ms-marco-MiniLM-L-6-v2"):
5 self.model = CrossEncoder(model)
6
7 def rerank(self, query, documents, top_k=5):
8 pairs = [(query, doc) for doc, _ in documents]
9 scores = self.model.predict(pairs)
10 ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
11 return ranked[:top_k]The full pipeline:
1retriever = HybridRetriever(documents)
2reranker = Reranker()
3
4# 1. Retrieve top 20 candidates (fast, coarse)
5candidates = retriever.search(user_query, top_k=20)
6
7# 2. Rerank to get top 5 (slow, precise)
8top_results = reranker.rerank(user_query, candidates, top_k=5)
9
10# 3. Generate with only the best chunks
11context = "\n\n".join([doc for doc, score in top_results])
12response = llm.generate(prompt_with_context(user_query, context))3. Grounded Citations
The problem: LLMs hallucinate. If you ask for citations, they'll make up page numbers, URLs, and quotes.
The solution: append citation indices to each chunk before generation, then parse them from the output.
1def build_prompt_with_citations(query, retrieved_chunks):
2 context_parts = []
3 for i, (chunk, score) in enumerate(retrieved_chunks):
4 context_parts.append(f"[{i+1}] {chunk}")
5 context = "\n\n".join(context_parts)
6
7 return f"""Answer the question based on the following sources.
8Cite sources using [N] notation. Only cite sources that exist below.
9
10Sources:
11{context}
12
13Question: {query}
14
15Answer:"""
16
17def extract_citations(text, max_citations):
18 import re
19 citations = re.findall(r'\[(\d+)\]', text)
20 valid = [int(c) for c in citations if 1 <= int(c) <= max_citations]
21 return list(set(valid))Why this works: The LLM can only reference indices [1] through [N] that you provided. If it generates [7] but you only gave it 5 chunks, extract_citations strips it.
4. Putting It All Together
1class ProductionRAG:
2 def __init__(self, documents):
3 self.retriever = HybridRetriever(documents)
4 self.reranker = Reranker()
5 self.llm = YourLLM()
6
7 def answer(self, query):
8 # 1. Hybrid retrieval
9 candidates = self.retriever.search(query, top_k=20)
10
11 # 2. Rerank
12 top_chunks = self.reranker.rerank(query, candidates, top_k=5)
13
14 # 3. Generate with citations
15 prompt = build_prompt_with_citations(query, top_chunks)
16 response = self.llm.generate(prompt)
17
18 # 4. Validate citations
19 valid_citations = extract_citations(response, len(top_chunks))
20
21 return {
22 "answer": response,
23 "citations": [top_chunks[i-1] for i in valid_citations],
24 "sources": [{"index": i+1, "text": chunk, "score": score}
25 for i, (chunk, score) in enumerate(top_chunks)],
26 }Summary
| Component | Purpose | Tool |
|---|---|---|
| Hybrid Search | Retrieve candidates using both keywords and semantics | BM25 + sentence-transformers |
| Cross-Encoder Reranking | Re-score candidates for actual relevance | cross-encoder/ms-marco-* |
| Grounded Citations | Prevent hallucinated sources | Index-based citation parsing |
| Alpha tuning | Control keyword vs semantic balance | 0.5 default, 0.7 for code, 0.3 for prose |
Basic RAG retrieves and generates. Production RAG retrieves, reranks, generates, and cites. The difference is night and day.
Go from Arduino to Production Firmware
The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.
Frequently Asked Questions
Quick answers to common questions

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.