What's the difference between a cross-encoder and a bi-encoder?

A bi-encoder encodes the query and documents separately, producing embeddings you can compare with cosine similarity. A cross-encoder takes the query and document together as input and outputs a relevance score — slower but much more accurate. Use bi-encoders for retrieval, cross-encoders for reranking.

How many documents should I rerank?

Rerank the top 20-50 documents from your retrieval step. Cross-encoders are expensive — reranking 1000 documents would be too slow for real-time use. The retrieval step narrows the field; reranking refines the top candidates.

Should I use BM25 or vector search?

Both. BM25 excels at keyword matching (exact terms, IDs, code snippets). Vector search excels at semantic matching (paraphrases, concepts). Hybrid search combines them and consistently outperforms either approach alone.

How do I prevent hallucinated citations?

Never let the LLM generate citation markers freely. Instead, append citation indices to each retrieved chunk before generation, then parse those indices from the output. If a citation index doesn't match a retrieved chunk, strip it.

Blog/Generative AI

Advanced RAG: Hybrid Search, Reranking, and Citation for Production

Basic RAG retrieves and generates. Production RAG uses hybrid search, cross-encoder reranking, and grounded citations. Here's how to build the latter.

| Advanced

Rajath KumarEdge AI Engineer & Founder, Analog Data

2026-06-27·14 min read

Advanced RAG: Hybrid Search, Reranking, and Citation for Production

Why Basic RAG Fails in Production

You followed the tutorial: embed documents, store in a vector DB, retrieve top-k, generate. It works on the demo. Then you put it in production and:

Retrieval misses obvious matches because the user used different terminology
Results are irrelevant because cosine similarity ≠ relevance
The LLM hallucinates citations that don't exist in the source
Latency is terrible because you're reranking 100 documents

Production RAG needs three things basic RAG doesn't have:

Hybrid search — combine keyword + semantic retrieval
Cross-encoder reranking — actually measure relevance, not just similarity
Grounded citations — force the LLM to cite real chunks

1. Hybrid Search: BM25 + Vector Search

Keyword search (BM25) catches exact matches. Vector search catches semantic matches. Together they catch everything.

python

1from rank_bm25 import BM25Okapi
2from sentence_transformers import SentenceTransformer
3import numpy as np
4
5class HybridRetriever:
6    def __init__(self, documents, embed_model="all-MiniLM-L6-v2"):
7        self.documents = documents
8        self.bm25 = BM25Okapi([doc.split() for doc in documents])
9        self.encoder = SentenceTransformer(embed_model)
10        self.embeddings = self.encoder.encode(documents, normalize_embeddings=True)
11
12    def search(self, query, top_k=20, alpha=0.5):
13        # BM25 scores
14        bm25_scores = self.bm25.get_scores(query.split())
15        bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-9)
16
17        # Vector scores
18        query_emb = self.encoder.encode([query], normalize_embeddings=True)
19        vector_scores = (self.embeddings @ query_emb.T).flatten()
20
21        # Combine
22        hybrid_scores = alpha * bm25_scores + (1 - alpha) * vector_scores
23        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
24
25        return [(self.documents[i], hybrid_scores[i]) for i in top_indices]

The alpha parameter controls the blend:

alpha=1.0 → pure BM25 (keyword matching)
alpha=0.0 → pure vector search (semantic matching)
alpha=0.5 → balanced (default for most use cases)

Tune it per use case. Code documentation? Lean toward BM25 (alpha=0.7). Natural language Q&A? Lean toward vector (alpha=0.3).

2. Cross-Encoder Reranking

Retrieval gives you candidates. Reranking gives you the right candidates.

A bi-encoder (what you used for retrieval) encodes query and document separately — fast but coarse. A cross-encoder encodes them together — slow but precise.

python

1from sentence_transformers import CrossEncoder
2
3class Reranker:
4    def __init__(self, model="cross-encoder/ms-marco-MiniLM-L-6-v2"):
5        self.model = CrossEncoder(model)
6
7    def rerank(self, query, documents, top_k=5):
8        pairs = [(query, doc) for doc, _ in documents]
9        scores = self.model.predict(pairs)
10        ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
11        return ranked[:top_k]

The full pipeline:

python

1retriever = HybridRetriever(documents)
2reranker = Reranker()
3
4# 1. Retrieve top 20 candidates (fast, coarse)
5candidates = retriever.search(user_query, top_k=20)
6
7# 2. Rerank to get top 5 (slow, precise)
8top_results = reranker.rerank(user_query, candidates, top_k=5)
9
10# 3. Generate with only the best chunks
11context = "\n\n".join([doc for doc, score in top_results])
12response = llm.generate(prompt_with_context(user_query, context))

3. Grounded Citations

The problem: LLMs hallucinate. If you ask for citations, they'll make up page numbers, URLs, and quotes.

The solution: append citation indices to each chunk before generation, then parse them from the output.

python

1def build_prompt_with_citations(query, retrieved_chunks):
2    context_parts = []
3    for i, (chunk, score) in enumerate(retrieved_chunks):
4        context_parts.append(f"[{i+1}] {chunk}")
5    context = "\n\n".join(context_parts)
6
7    return f"""Answer the question based on the following sources.
8Cite sources using [N] notation. Only cite sources that exist below.
9
10Sources:
11{context}
12
13Question: {query}
14
15Answer:"""
16
17def extract_citations(text, max_citations):
18    import re
19    citations = re.findall(r'\[(\d+)\]', text)
20    valid = [int(c) for c in citations if 1 <= int(c) <= max_citations]
21    return list(set(valid))

Why this works: The LLM can only reference indices [1] through [N] that you provided. If it generates [7] but you only gave it 5 chunks, extract_citations strips it.

4. Putting It All Together

python

1class ProductionRAG:
2    def __init__(self, documents):
3        self.retriever = HybridRetriever(documents)
4        self.reranker = Reranker()
5        self.llm = YourLLM()
6
7    def answer(self, query):
8        # 1. Hybrid retrieval
9        candidates = self.retriever.search(query, top_k=20)
10
11        # 2. Rerank
12        top_chunks = self.reranker.rerank(query, candidates, top_k=5)
13
14        # 3. Generate with citations
15        prompt = build_prompt_with_citations(query, top_chunks)
16        response = self.llm.generate(prompt)
17
18        # 4. Validate citations
19        valid_citations = extract_citations(response, len(top_chunks))
20
21        return {
22            "answer": response,
23            "citations": [top_chunks[i-1] for i in valid_citations],
24            "sources": [{"index": i+1, "text": chunk, "score": score}
25                       for i, (chunk, score) in enumerate(top_chunks)],
26        }

Summary

Component	Purpose	Tool
Hybrid Search	Retrieve candidates using both keywords and semantics	BM25 + sentence-transformers
Cross-Encoder Reranking	Re-score candidates for actual relevance	`cross-encoder/ms-marco-*`
Grounded Citations	Prevent hallucinated sources	Index-based citation parsing
Alpha tuning	Control keyword vs semantic balance	0.5 default, 0.7 for code, 0.3 for prose

Basic RAG retrieves and generates. Production RAG retrieves, reranks, generates, and cites. The difference is night and day.

Live Workshop

Go from Arduino to Production Firmware

The ESP32-IDF Workshop covers ESP-IDF from scratch — tasks, queues, OTA, Wifi management, and deploying firmware that doesn't break at 3am.

Join the Workshop →

#RAG #LLM #Generative AI #Hybrid Search #Reranking #Python

Frequently Asked Questions

Quick answers to common questions

Written by

Rajath Kumar

Edge AI Engineer & Founder, Analog Data

Bengaluru, India

I build things that run on chips and the software that talks to them. ESP32, STM32, FreeRTOS, FastAPI, TinyML — from bare-metal firmware to cloud backends to on-device inference. Based in Bengaluru. Founder of Analog Data.