Skip to content
Go back

RAG Beyond Vector Search: BM25, Hybrid, Re-ranking

By SumGuy 11 min read
RAG Beyond Vector Search: BM25, Hybrid, Re-ranking

When Your RAG System Can’t Find Its Own Acronyms

You’ve built it. You’ve hosted it. You’ve dumped 10,000 documents into your vector database, and everything should work perfectly—until someone asks for JSON-RPC or mentions a serial number like SN-48239-X, and your retrieval comes back blank.

Turns out, dense embeddings are terrible at exact matches. They’re great at semantic fuzzy finding. They’re useless at acronyms, IDs, and weird product names. So you either patch it with hybrid search (dense + keyword), or you watch your RAG system fail on the most searchable queries.

Here’s the thing: vector-only search is a foot gun. Most production RAG systems need at least two retrieval streams, smart re-ranking, and a thoughtful chunking strategy. Let me walk you through what actually works.


Dense embedding models (like OpenAI’s text-embedding-3-large, Nomic’s embed-text-4k, or BGE-M3) are trained to map semantically similar text to nearby points in vector space. Sounds great. It is great—for semantic questions.

But dense search breaks on:

  1. Acronyms and abbreviationsLLM, RAG, CRUD, CIDR. The model sees rare tokens and produces mediocre embeddings. A document full of acronyms won’t cluster near the acronym query.
  2. Exact IDs, serial numbers, product codesSKU-2024-09-001, invoice #INV-48392, github issue #15392. These are nearly meaningless semantically, but critically searchable. Dense vectors fail here catastrophically.
  3. Domain-specific jargon that’s under-represented in training data — If your documents talk about a proprietary framework or internal tool name, the embedding space may not have a good representation.
  4. Boolean/faceted queries — “Show me all docs with tag=compliance AND year>2025”. Dense search can’t do this; you need metadata filtering.
  5. Rare terms and misspellingsPostgreSQL vs Postgres, Kubernetes vs K8s. Dense vectors generalize, but they don’t match rare lexical variants cleanly.

The result: your users ask for something exact, your retriever pulls back semantically vague junk, and your LLM hallucinates because the context is useless.


BM25: The Boring Winner Nobody Talks About

Here’s the secret: BM25 is still the gold standard for keyword retrieval, and it’s been around since 1994. It’s an evolution of TF-IDF that accounts for document length and term saturation.

BM25 is lexical. It matches tokens. No vectors, no embeddings, no semantic understanding—just “does this word appear in this document, and how often?”

Why BM25 still dominates:

The catch: BM25 has no semantic understanding. Query distributed caching won’t match Redis cluster, because those tokens don’t overlap.

So the answer isn’t “use BM25 instead”—it’s use both.


Hybrid Retrieval: Dense + BM25

The magic is combining dense and lexical search, then re-ranking the merged results.

Two strategies:

1. Reciprocal Rank Fusion (RRF)

RRF is dirt simple. For each document, you get a rank position from dense search (e.g., position 3) and a rank position from BM25 (e.g., position 7). RRF combines them:

RRF score = 1 / (k + rank_position)

where k is typically 60 (avoids dominance by position 1).

Dense rank 3: 1 / (60 + 3) = 0.015
BM25 rank 7: 1 / (60 + 7) = 0.015
Combined: 0.030

RRF doesn’t care about the magnitude of the original scores—only relative rank. It’s beautifully robust because a crappy dense score and a great BM25 score both contribute fairly.

2. Weighted Score Fusion

Normalize both scores to [0, 1], then blend:

final_score = 0.4 * dense_score + 0.6 * bm25_score

More control, but requires careful tuning. If your dense scores are mushier, the weights need adjusting. RRF sidesteps this entirely.


Re-ranking: The Silent Efficiency Weapon

Okay, now you’ve got a hybrid retrieval pipeline pulling 50 documents. But you only need the top 5 for your LLM context.

Here’s the trap: re-ranking with a cross-encoder is faster and better than deeper retrieval with a dense model.

A cross-encoder (like BGE-reranker-v2) directly scores each query-document pair:

cross_encoder([query, document]) → relevance_score [0, 1]

Unlike dense embedders (which score by vector distance), cross-encoders see the full query and document together, so they catch nuance. They’re slower per-pair (maybe 50-200 pairs/sec on CPU), but you only run them on the top 50–100 from retrieval, not millions.

The workflow:

  1. Retrieval (fast): Dense + BM25 hybrid → top 50–100 candidates
  2. Re-ranking (slow but focused): Cross-encoder scores the 50 → top 5 returned to LLM

This beats:

Popular cross-encoders:

For self-hosted, grab BGE-reranker-v2 and run it locally via Hugging Face transformers.


A Real Hybrid + Re-rank Pipeline

Here’s a working Python sketch using open-source libraries:

hybrid_rag.py
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi
import numpy as np
# Setup
dense_model = SentenceTransformer("nomic-ai/nomic-embed-text-4k")
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
documents = [
"JSON-RPC is a stateless, light-weight RPC protocol...",
"REST APIs use HTTP methods for CRUD operations...",
"gRPC uses Protocol Buffers for efficient serialization...",
]
# Tokenize for BM25
corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)
# Retrieve: Dense + BM25
query = "JSON-RPC vs gRPC"
query_embedding = dense_model.encode(query)
document_embeddings = dense_model.encode(documents)
# Dense scores (cosine similarity)
dense_scores = np.dot(document_embeddings, query_embedding)
dense_ranks = np.argsort(-dense_scores)
# BM25 scores
query_tokens = query.lower().split()
bm25_scores = bm25.get_scores(query_tokens)
bm25_ranks = np.argsort(-bm25_scores)
# RRF fusion
rrf_scores = {}
for i, doc_idx in enumerate(dense_ranks):
rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1.0 / (60 + i + 1)
for i, doc_idx in enumerate(bm25_ranks):
rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1.0 / (60 + i + 1)
# Top-50 candidates for re-ranking
candidates = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:50]
candidate_docs = [documents[idx] for idx, _ in candidates]
# Re-rank with cross-encoder
rerank_scores = reranker.predict([[query, doc] for doc in candidate_docs])
top_5 = sorted(zip(candidate_docs, rerank_scores), key=lambda x: x[1], reverse=True)[:5]
print(f"Top 5 results for '{query}':")
for doc, score in top_5:
print(f" [{score:.3f}] {doc[:60]}...")

This pipeline handles acronyms (BM25 catches JSON-RPC), semantic similarity (dense catches gRPC vs RPC), and produces a ranked list of the best candidates.


Chunking Strategies That Survive Hybrid Retrieval

Your chunk size matters. Too small, and you lose context. Too large, and you dilute relevance.

Sentence-level chunking:

Paragraph-level chunking (128–512 tokens):

Parent-child chunking:

Sliding window:

Example: if chunking a 10,000-word article, use sliding windows of 512 tokens with 128-token overlap. Hybrid search will find the right window; re-ranking will promote the most relevant.


Metadata Filtering and Query Rewriting

RAG isn’t just retrieval + re-rank. Smart systems add:

Metadata filtering:

Most vector DBs (Qdrant, Weaviate, Chroma, Pinecone) support filtering by metadata at retrieval time:

results = db.similarity_search(
query,
k=50,
filter={"author": "sumguy", "year": {"$gte": 2024}}
)

Use this to pre-filter before dense + BM25 kicks in. Reduces retrieval noise without running a query rewrite.

Query rewriting (HyDE):

Hypothetical Document Embeddings: instead of embedding the user’s query directly, generate a fake document that would answer it, then embed that:

hypothesis_prompt = f"""
Generate a document that would answer this query: "{query}"
"""
fake_doc = llm.generate(hypothesis_prompt)
embedding = dense_model.encode(fake_doc)

The fake document is often more “embeddable” than the terse query. Works well for vague queries like “how do I set up a homelab?”


Evaluation: Hit@k and NDCG

How do you know if your hybrid pipeline is better than pure dense?

Hit@k (Recall):

NDCG (Normalized Discounted Cumulative Gain):

For self-hosted RAG, build a test set of 50–100 queries with relevant docs marked. Run dense-only, BM25-only, and hybrid pipelines through it. Check NDCG@5 and Hit@20. Hybrid usually wins on both.


When to Abandon Vector Databases Entirely

Here’s the uncomfortable truth: not every RAG problem needs a vector database.

If your use case is:

…you’re better off with a full-text search engine like Meilisearch or Typesense. These are purpose-built for keyword + filtering, with sub-100ms latency and no embedding overhead.

Use a vector DB when you need:

Hybrid retrieval (dense + BM25) bridges the gap. You get both worlds—but if you’re not using the dense part meaningfully, bail and use Meilisearch.


What to Add When

Starting a RAG system? Here’s the roadmap:

Phase 1: Paragraph-level chunking + dense search

Phase 2: Add BM25 + RRF fusion

Phase 3: Cross-encoder re-ranking

Phase 4: Query rewriting (HyDE) + metadata filtering

Phase 5: Swap for Meilisearch if keyword-heavy

Most production systems live at Phase 3. Phase 4 is optimization theater. Phase 5 is a pivot, not an addition.

Start simple. Measure with NDCG@5. Add the next piece only if it moves the needle.


The Honest Reality

Dense embeddings are amazing. But they’re not magic, and they’re not sufficient on their own. Your RAG system needs:

  1. Hybrid retrieval (dense + BM25) to catch both semantic and exact matches
  2. Smart chunking (paragraph or parent-child) to give both methods enough signal
  3. Re-ranking (cross-encoder) to promote the genuinely best candidates
  4. Metadata filtering to reduce noise before retrieval even starts

If you skip hybrid and rely on pure dense, you’ll spend weeks debugging why your system can’t find Kubernetes when the doc says K8s.

Do the hybrid thing. It’s boring, it works, and it’ll save you a 2 AM debugging session.

Now go make your RAG system actually useful.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
ModSecurity vs Coraza WAF

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts