50% OFF

ESP32-IDF Workshop

Blog/Generative AI

Advanced RAG: Hybrid Search, Reranking, and Citation for Production

Basic RAG retrieves and generates. Production RAG uses hybrid search, cross-encoder reranking, and grounded citations. Here's how to build the latter.

| Advanced
Rajath Kumar
Rajath KumarEdge AI Engineer & Founder, Analog Data
2026-06-27·14 min read
Advanced RAG: Hybrid Search, Reranking, and Citation for Production

Why Basic RAG Fails in Production

You followed the tutorial: embed documents, store in a vector DB, retrieve top-k, generate. It works on the demo. Then you put it in production and:

  • Retrieval misses obvious matches because the user used different terminology
  • Results are irrelevant because cosine similarity ≠ relevance
  • The LLM hallucinates citations that don't exist in the source
  • Latency is terrible because you're reranking 100 documents

Production RAG needs three things basic RAG doesn't have:

  1. Hybrid search — combine keyword + semantic retrieval
  2. Cross-encoder reranking — actually measure relevance, not just similarity
  3. Grounded citations — force the LLM to cite real chunks

Keyword search (BM25) catches exact matches. Vector search catches semantic matches. Together they catch everything.

python
1from rank_bm25 import BM25Okapi
2from sentence_transformers import SentenceTransformer
3import numpy as np
4
5class HybridRetriever:
6    def __init__(self, documents, embed_model="all-MiniLM-L6-v2"):
7        self.documents = documents
8        self.bm25 = BM25Okapi([doc.split() for doc in documents])
9        self.encoder = SentenceTransformer(embed_model)
10        self.embeddings = self.encoder.encode(documents, normalize_embeddings=True)
11
12    def search(self, query, top_k=20, alpha=0.5):
13        # BM25 scores
14        bm25_scores = self.bm25.get_scores(query.split())
15        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-9)
16
17        # Vector scores
18        query_emb = self.encoder.encode([query], normalize_embeddings=True)
19        vector_scores = (self.embeddings @ query_emb.T).flatten()
20
21        # Combine
22        hybrid_scores = alpha * bm25_scores + (1 - alpha) * vector_scores
23        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
24
25        return [(self.documents[i], hybrid_scores[i]) for i in top_indices]

The alpha parameter controls the blend:

  • alpha=1.0 → pure BM25 (keyword matching)
  • alpha=0.0 → pure vector search (semantic matching)
  • alpha=0.5 → balanced (default for most use cases)

Tune it per use case. Code documentation? Lean toward BM25 (alpha=0.7). Natural language Q&A? Lean toward vector (alpha=0.3).

2. Cross-Encoder Reranking

Retrieval gives you candidates. Reranking gives you the right candidates.

A bi-encoder (what you used for retrieval) encodes query and document separately — fast but coarse. A cross-encoder encodes them together — slow but precise.

python
1from sentence_transformers import CrossEncoder
2
3class Reranker:
4    def __init__(self, model="cross-encoder/ms-marco-MiniLM-L-6-v2"):
5        self.model = CrossEncoder(model)
6
7    def rerank(self, query, documents, top_k=5):
8        pairs = [(query, doc) for doc, _ in documents]
9        scores = self.model.predict(pairs)
10        ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
11        return ranked[:top_k]

The full pipeline:

python
1retriever = HybridRetriever(documents)
2reranker = Reranker()
3
4# 1. Retrieve top 20 candidates (fast, coarse)
5candidates = retriever.search(user_query, top_k=20)
6
7# 2. Rerank to get top 5 (slow, precise)
8top_results = reranker.rerank(user_query, candidates, top_k=5)
9
10# 3. Generate with only the best chunks
11context = "\n\n".join([doc for doc, score in top_results])
12response = llm.generate(prompt_with_context(user_query, context))

3. Grounded Citations

The problem: LLMs hallucinate. If you ask for citations, they'll make up page numbers, URLs, and quotes.

The solution: append citation indices to each chunk before generation, then parse them from the output.

python
1def build_prompt_with_citations(query, retrieved_chunks):
2    context_parts = []
3    for i, (chunk, score) in enumerate(retrieved_chunks):
4        context_parts.append(f"[{i+1}] {chunk}")
5    context = "\n\n".join(context_parts)
6
7    return f"""Answer the question based on the following sources.
8Cite sources using [N] notation. Only cite sources that exist below.
9
10Sources:
11{context}
12
13Question: {query}
14
15Answer:"""
16
17def extract_citations(text, max_citations):
18    import re
19    citations = re.findall(r'\[(\d+)\]', text)
20    valid = [int(c) for c in citations if 1 <= int(c) <= max_citations]
21    return list(set(valid))

Why this works: The LLM can only reference indices [1] through [N] that you provided. If it generates [7] but you only gave it 5 chunks, extract_citations strips it.

4. Putting It All Together

python
1class ProductionRAG:
2    def __init__(self, documents):
3        self.retriever = HybridRetriever(documents)
4        self.reranker = Reranker()
5        self.llm = YourLLM()
6
7    def answer(self, query):
8        # 1. Hybrid retrieval
9        candidates = self.retriever.search(query, top_k=20)
10
11        # 2. Rerank
12        top_chunks = self.reranker.rerank(query, candidates, top_k=5)
13
14        # 3. Generate with citations
15        prompt = build_prompt_with_citations(query, top_chunks)
16        response = self.llm.generate(prompt)
17
18        # 4. Validate citations
19        valid_citations = extract_citations(response, len(top_chunks))
20
21        return {
22            "answer": response,
23            "citations": [top_chunks[i-1] for i in valid_citations],
24            "sources": [{"index": i+1, "text": chunk, "score": score}
25                       for i, (chunk, score) in enumerate(top_chunks)],
26        }

Summary

ComponentPurposeTool
Hybrid SearchRetrieve candidates using both keywords and semanticsBM25 + sentence-transformers
Cross-Encoder RerankingRe-score candidates for actual relevancecross-encoder/ms-marco-*
Grounded CitationsPrevent hallucinated sourcesIndex-based citation parsing
Alpha tuningControl keyword vs semantic balance0.5 default, 0.7 for code, 0.3 for prose

Basic RAG retrieves and generates. Production RAG retrieves, reranks, generates, and cites. The difference is night and day.

Share
Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

Frequently Asked Questions

Quick answers to common questions

Rajath Kumar

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.

More in Generative AI