RAG Beyond Vector Search: BM25, Hybrid, Re-ranking

When Your RAG System Can’t Find Its Own Acronyms

You’ve built it. You’ve hosted it. You’ve dumped 10,000 documents into your vector database, and everything should work perfectly, until someone asks for JSON-RPC or mentions a serial number like SN-48239-X, and your retrieval comes back blank.

Turns out, dense embeddings are terrible at exact matches. They’re great at semantic fuzzy finding. They’re useless at acronyms, IDs, and weird product names. So you either patch it with hybrid search (dense + keyword), or you watch your RAG system fail on the most searchable queries.

Vector-only search is a foot gun. Most production RAG systems need at least two retrieval streams, smart re-ranking, and a thoughtful chunking strategy. Let me walk you through what actually works.

The Failure Modes of Pure Dense Search

Dense embedding models (like OpenAI’s text-embedding-3-large, Nomic’s nomic-embed-text-v1.5, or BGE-M3) are trained to map semantically similar text to nearby points in vector space. Sounds great. It is great, for semantic questions.

But dense search breaks on:

Acronyms and abbreviations: LLM, RAG, CRUD, CIDR. The model sees rare tokens and produces mediocre embeddings. A document full of acronyms won’t cluster near the acronym query.
Exact IDs, serial numbers, product codes: SKU-2024-09-001, invoice #INV-48392, github issue #15392. These are nearly meaningless semantically, but critically searchable. Dense vectors fail here catastrophically.
Domain-specific jargon that’s under-represented in training data: If your documents talk about a proprietary framework or internal tool name, the embedding space may not have a good representation.
Boolean/faceted queries: “Show me all docs with tag=compliance AND year>2025”. Dense search can’t do this; you need metadata filtering.
Rare terms and misspellings: PostgreSQL vs Postgres, Kubernetes vs K8s. Dense vectors generalize, but they don’t match rare lexical variants cleanly.

The result: your users ask for something exact, your retriever pulls back semantically vague junk, and your LLM hallucinates because the context is useless.

BM25: The Boring Winner Nobody Talks About

Here’s the secret: BM25 is still the gold standard for keyword retrieval, and it’s been around since 1994. It’s an evolution of TF-IDF that accounts for document length and term saturation.

BM25 is lexical. It matches tokens. No vectors, no embeddings, no semantic understanding, just “does this word appear in this document, and how often?”

Why BM25 still dominates:

Exact matches work. Query JSON-RPC? BM25 finds all docs with that token.
Frequency matters. A doc with RAG repeated 20 times ranks higher than one mentioning it once.
Term scarcity is a feature. Rare query terms boost signal. If someone searches for InfiniBand, a doc mentioning it ranks absurdly high.
Requires no training. No embeddings to generate, no GPU, no API calls.
Predictable performance. No hallucinations from semantic drift.

The catch: BM25 has no semantic understanding. Query distributed caching won’t match Redis cluster, because those tokens don’t overlap.

So the answer isn’t “use BM25 instead”: it’s use both.

Hybrid Retrieval: Dense + BM25

The magic is combining dense and lexical search, then re-ranking the merged results.

Two strategies:

1. Reciprocal Rank Fusion (RRF)

RRF is dirt simple. For each document, you get a rank position from dense search (e.g., position 3) and a rank position from BM25 (e.g., position 7). RRF combines them:

RRF score = 1 / (k + rank_position)

where k is typically 60 (avoids dominance by position 1).

Dense rank 3: 1 / (60 + 3) = 0.015 BM25 rank 7: 1 / (60 + 7) = 0.015 Combined: 0.030

RRF doesn’t care about the magnitude of the original scores, only relative rank. It’s beautifully robust because a crappy dense score and a great BM25 score both contribute fairly.

2. Weighted Score Fusion

Normalize both scores to [0, 1], then blend:

final_score = 0.4 * dense_score + 0.6 * bm25_score

More control, but requires careful tuning. If your dense scores are mushier, the weights need adjusting. RRF sidesteps this entirely.

Re-ranking: The Silent Efficiency Weapon

Okay, now you’ve got a hybrid retrieval pipeline pulling 50 documents. But you only need the top 5 for your LLM context.

Here’s the trap: re-ranking with a cross-encoder is faster and better than deeper retrieval with a dense model.

A cross-encoder (like BGE-reranker-v2) directly scores each query-document pair:

cross_encoder([query, document]) → relevance_score [0, 1]

Unlike dense embedders (which score by vector distance), cross-encoders see the full query and document together, so they catch nuance. They’re slower per-pair (maybe 50-200 pairs/sec on CPU), but you only run them on the top 50 to 100 from retrieval, not millions.

The workflow:

Retrieval (fast): Dense + BM25 hybrid → top 50 to 100 candidates
Re-ranking (slow but focused): Cross-encoder scores the 50 → top 5 returned to LLM

This beats:

Pure dense: fewer candidates means potentially wrong ones ranked first
Deeper dense retrieval: way slower for no better quality
Dense-only re-ranking: dumb, self-referential scoring

Popular cross-encoders:

BGE-reranker-v2 (BAAI, open source): fast, very good, no API cost
Cohere Reranker (proprietary API): excellent, but costs per request
Jina Reranker (proprietary API): solid, reasonable cost, good for multilingual

For self-hosted, grab BGE-reranker-v2 and run it locally via Hugging Face transformers.

A Real Hybrid + Re-rank Pipeline

Here’s a working Python sketch using open-source libraries:

from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi
import numpy as np

# Setup
dense_model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
documents = [
    "JSON-RPC is a stateless, light-weight RPC protocol...",
    "REST APIs use HTTP methods for CRUD operations...",
    "gRPC uses Protocol Buffers for efficient serialization...",
]

# Tokenize for BM25
corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)

# Retrieve: Dense + BM25
query = "JSON-RPC vs gRPC"
query_embedding = dense_model.encode(query)
document_embeddings = dense_model.encode(documents)

# Dense scores (cosine similarity)
dense_scores = np.dot(document_embeddings, query_embedding)
dense_ranks = np.argsort(-dense_scores)

# BM25 scores
query_tokens = query.lower().split()
bm25_scores = bm25.get_scores(query_tokens)
bm25_ranks = np.argsort(-bm25_scores)

# RRF fusion
rrf_scores = {}
for i, doc_idx in enumerate(dense_ranks):
    rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1.0 / (60 + i + 1)
for i, doc_idx in enumerate(bm25_ranks):
    rrf_scores[doc_idx] = rrf_scores.get(doc_idx, 0) + 1.0 / (60 + i + 1)

# Top-50 candidates for re-ranking
candidates = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:50]
candidate_docs = [documents[idx] for idx, _ in candidates]

# Re-rank with cross-encoder
rerank_scores = reranker.predict([[query, doc] for doc in candidate_docs])
top_5 = sorted(zip(candidate_docs, rerank_scores), key=lambda x: x[1], reverse=True)[:5]

print(f"Top 5 results for '{query}':")
for doc, score in top_5:
    print(f"  [{score:.3f}] {doc[:60]}...")

This pipeline handles acronyms (BM25 catches JSON-RPC), semantic similarity (dense catches gRPC vs RPC), and produces a ranked list of the best candidates.

Chunking Strategies That Survive Hybrid Retrieval

Your chunk size matters. Too small, and you lose context. Too large, and you dilute relevance.

Sentence-level chunking:

Fine-grained. Great for dense search (short semantic units).
Tricky for BM25 (too few tokens per chunk to match long queries).
Use: short, factual docs (FAQs, release notes).

Paragraph-level chunking (128 to 512 tokens):

Sweet spot for hybrid. BM25 has enough words to match, dense still captures semantics.
Standard choice for most RAG systems.

Parent-child chunking:

Small chunks (128 tokens) linked to parent paragraphs or sections.
Retrieve on small chunks, return larger parent context to LLM.
Best for hierarchical docs (manuals, specs, API docs).

Sliding window:

Chunks with overlap (e.g., chunk 1 tokens 0 to 200, chunk 2 tokens 100 to 300).
Avoids boundary artifacts where important context gets split.
Adds storage cost (more chunks), but worth it for dense search quality.

Example: if chunking a 10,000-word article, use sliding windows of 512 tokens with 128-token overlap. Hybrid search will find the right window; re-ranking will promote the most relevant.

Metadata Filtering and Query Rewriting

RAG isn’t just retrieval + re-rank. Smart systems add:

Metadata filtering:

Most vector DBs (Qdrant, Weaviate, Chroma, Pinecone) support filtering by metadata at retrieval time:

results = db.similarity_search(
    query,
    k=50,
    filter={"author": "sumguy", "year": {"$gte": 2024}}
)

Use this to pre-filter before dense + BM25 kicks in. Reduces retrieval noise without running a query rewrite.

Query rewriting (HyDE):

Hypothetical Document Embeddings: instead of embedding the user’s query directly, generate a fake document that would answer it, then embed that:

hypothesis_prompt = f"""
Generate a document that would answer this query: "{query}"
"""
fake_doc = llm.generate(hypothesis_prompt)
embedding = dense_model.encode(fake_doc)

The fake document is often more “embeddable” than the terse query. Works well for vague queries like “how do I set up a homelab?”

Evaluation: Hit@k and NDCG

How do you know if your hybrid pipeline is better than pure dense?

Hit@k (Recall):

If the ground-truth document appears in the top k results, Hit = 1, else 0.
Average across your test queries.
Simple, tells you coverage. But doesn’t reward ranking, top 5 is the same as top 50 if the doc is in there.

NDCG (Normalized Discounted Cumulative Gain):

Rewards ranking. A relevant doc at position 1 is worth more than at position 10.
Commonly reported as NDCG@10, NDCG@5.
Harder to compute, but more realistic. If you only use top 5 for your LLM, NDCG@5 matters more than Hit@50.

For self-hosted RAG, build a test set of 50 to 100 queries with relevant docs marked. Run dense-only, BM25-only, and hybrid pipelines through it. Check NDCG@5 and Hit@20. Hybrid usually wins on both.

When to Abandon Vector Databases Entirely

Here’s the uncomfortable truth: not every RAG problem needs a vector database.

If your use case is:

Mostly exact-match, keyword-heavy queries (internal docs, logs, configs)
High-cardinality filtering (thousands of unique dates, SKUs, categories)
Sub-100ms latency requirements
Small corpus (< 1M documents)

…you’re better off with a full-text search engine like Meilisearch or Typesense. These are purpose-built for keyword + filtering, with sub-100ms latency and no embedding overhead.

Use a vector DB when you need:

Semantic fuzzy matching (documents about the same topic even if words differ)
Cross-language search (map query and docs to the same embedding space)
Billion-scale corpora (dense search scales better than full-text at huge volumes)

Hybrid retrieval (dense + BM25) bridges the gap. You get both worlds, but if you’re not using the dense part meaningfully, bail and use Meilisearch.

What to Add When

Starting a RAG system? Here’s the roadmap:

Phase 1: Paragraph-level chunking + dense search

Simple, gets you 70% of the way.
Handles semantic queries well.

Phase 2: Add BM25 + RRF fusion

Fixes acronyms, IDs, exact terms.
Probably 15 to 20% improvement on NDCG.
No API calls, no GPU requirement.

Phase 3: Cross-encoder re-ranking

Run BGE-reranker-v2 locally.
Spend 50ms re-ranking top 100 candidates.
Another 10% quality bump.

Phase 4: Query rewriting (HyDE) + metadata filtering

Only if you have structured metadata or vague user queries.
Adds complexity; test first.

Phase 5: Swap for Meilisearch if keyword-heavy

If Phase 2 proves that BM25 scores are consistently higher than dense, you’ve got a keyword problem. Use a search engine.

Most production systems live at Phase 3. Phase 4 is optimization theater. Phase 5 is a pivot, not an addition.

Start simple. Measure with NDCG@5. Add the next piece only if it moves the needle.

The Honest Reality

Dense embeddings are amazing. But they’re not magic, and they’re not sufficient on their own. Your RAG system needs:

Hybrid retrieval (dense + BM25) to catch both semantic and exact matches
Smart chunking (paragraph or parent-child) to give both methods enough signal
Re-ranking (cross-encoder) to promote the genuinely best candidates
Metadata filtering to reduce noise before retrieval even starts

If you skip hybrid and rely on pure dense, you’ll spend weeks debugging why your system can’t find Kubernetes when the doc says K8s.

Do the hybrid thing. It’s boring, it works, and it’ll save you a 2 AM debugging session.

Now go make your RAG system actually useful.

RAG Beyond Vector Search: BM25, Hybrid, Re-ranking

When Your RAG System Can’t Find Its Own Acronyms

The Failure Modes of Pure Dense Search

BM25: The Boring Winner Nobody Talks About

Hybrid Retrieval: Dense + BM25

1. Reciprocal Rank Fusion (RRF)

2. Weighted Score Fusion

Re-ranking: The Silent Efficiency Weapon

A Real Hybrid + Re-rank Pipeline

Chunking Strategies That Survive Hybrid Retrieval

Metadata Filtering and Query Rewriting

Evaluation: Hit@k and NDCG

When to Abandon Vector Databases Entirely

What to Add When

The Honest Reality

Responses from around the web

Discussion

Related Posts

AnythingLLM as Knowledge Base

Qdrant vs Weaviate vs Chroma: Vector DB Showdown

LM Studio vs Jan vs GPT4All: Desktop LLM Clients

RAGAS: Evaluating RAG Without Vibes

RAG Beyond Vector Search: BM25, Hybrid, Re-ranking

When Your RAG System Can’t Find Its Own Acronyms

The Failure Modes of Pure Dense Search

BM25: The Boring Winner Nobody Talks About

Hybrid Retrieval: Dense + BM25

1. Reciprocal Rank Fusion (RRF)

2. Weighted Score Fusion

Re-ranking: The Silent Efficiency Weapon

A Real Hybrid + Re-rank Pipeline

Chunking Strategies That Survive Hybrid Retrieval

Metadata Filtering and Query Rewriting

Evaluation: Hit@k and NDCG

When to Abandon Vector Databases Entirely

What to Add When

The Honest Reality

Related Reading

Responses from around the web

Discussion

Related Posts

AnythingLLM as Knowledge Base

Qdrant vs Weaviate vs Chroma: Vector DB Showdown

LM Studio vs Jan vs GPT4All: Desktop LLM Clients

RAGAS: Evaluating RAG Without Vibes