Skip to main content

Command Palette

Search for a command to run...

Hybrid Search: BM25 + Vectors Beats Either Alone

Dense retrieval misses keywords. Sparse retrieval misses semantics. Real RAG uses both, fuses the results, and ships a product that finds your product names, error codes, and edge cases.

Updated
11 min read
Hybrid Search: BM25 + Vectors Beats Either Alone

You've chunked your documents well. You've embedded them with a frontier model. You've stored them in pgvector with a clean HNSW index. Your eval set passes. And then a user searches for ERR_BILL_4042 — your company's specific billing error code — and gets back three irrelevant pages about invoices because the vector model has never seen that token, can't embed it meaningfully, and is happily returning the closest generic match.

Welcome to the gap between dense retrieval and sparse retrieval. This is the gap that hybrid search closes.

Dense retrieval — what we've been doing so far, with embeddings — is great at semantic meaning. It understands that "I forgot my login" and "password reset" are the same question. It is bad at exact keywords it wasn't trained on: product names, SKUs, error codes, proper nouns, numbers, acronyms your organisation made up last week.

Sparse retrieval — the old-school keyword approach that Elasticsearch and Lucene built a decade on — is the mirror image. It's excellent at exact matches. It will find ERR_BILL_4042 on the first try. It is bad at semantic matches. Ask it "I forgot my login" and it will return documents about the word "login," not documents about password reset.

Hybrid search is the idea that you run both and combine the results. The combination catches what either alone would miss, and it's one of the highest-ROI upgrades you can make to a RAG system. This post is the shortest practical explanation: what BM25 actually is, how to run it alongside your dense search, and how to fuse the two result lists without inventing a research paper.


BM25 in one paragraph

BM25 is a ranking algorithm that scores documents based on how often the query's words appear in them, with two adjustments: common words count for less than rare words, and long documents get a small penalty so that a 10-page document doesn't beat a 1-page document just by containing more copies of the query word. That's it. It's been around since the 1990s. It has no neural network in it, no training, no model, no tokenizer beyond simple word-splitting.

Given a query like ERR_BILL_4042 invoice duplicate, BM25 computes, for each document, a sum of per-word contributions. A document containing ERR_BILL_4042 gets a huge boost because that word is extremely rare. A document containing invoice gets a small boost. A document containing duplicate gets a medium boost. The final score is the sum. Top-k documents by score are your results.

The point is not to understand the math (go read the Wikipedia entry in five minutes if you want). The point is that BM25 complements embeddings in a specific, reliable way: it catches exact tokens that embeddings smooth over.

Two retrieval paths, one fusion step, one final ranked list. Let's build it.


Running BM25 alongside your vectors

The easiest way, if you're already on Postgres: use the built-in full-text search. Postgres has had tsvector/tsquery for years, and it implements a variant of BM25 under the hood. You add a generated column, an index, and a second query that runs alongside your pgvector query.

-- On your existing docs table (see b11-vector-store-or-not)
ALTER TABLE docs
  ADD COLUMN tsv tsvector
  GENERATED ALWAYS AS (to_tsvector('english', text)) STORED;

CREATE INDEX docs_tsv_idx ON docs USING gin(tsv);

Now you can run:

SELECT id, text, ts_rank_cd(tsv, query) AS rank
FROM docs, plainto_tsquery('english', 'password reset forgot') query
WHERE tsv @@ query
ORDER BY rank DESC
LIMIT 20;

That's your sparse top-20. Run it in parallel with your dense top-20 from pgvector. Both queries hit the same Postgres instance. No new infrastructure. No new data store.

The application code looks like this:

# pip install psycopg openai numpy
import os
import numpy as np
import psycopg
from openai import OpenAI

llm = OpenAI()
conn = psycopg.connect(os.environ["DATABASE_URL"])

def embed(text: str) -> np.ndarray:
    resp = llm.embeddings.create(model="text-embedding-3-small", input=[text])
    v = np.array(resp.data[0].embedding, dtype=np.float32)
    return v / np.linalg.norm(v)

def dense_search(query: str, k: int = 20) -> list[tuple[int, str, float]]:
    q = embed(query)
    with conn.cursor() as cur:
        cur.execute(
            "SELECT id, text, 1 - (vec <=> %s::vector) AS score "
            "FROM docs ORDER BY vec <=> %s::vector LIMIT %s",
            (q.tolist(), q.tolist(), k),
        )
        return list(cur.fetchall())

def sparse_search(query: str, k: int = 20) -> list[tuple[int, str, float]]:
    with conn.cursor() as cur:
        cur.execute(
            "SELECT id, text, ts_rank_cd(tsv, plainto_tsquery('english', %s)) AS score "
            "FROM docs "
            "WHERE tsv @@ plainto_tsquery('english', %s) "
            "ORDER BY score DESC LIMIT %s",
            (query, query, k),
        )
        return list(cur.fetchall())

Two functions, one vector-based, one keyword-based. Each returns an ordered list of (id, text, score). Now we need to combine them.


Reciprocal rank fusion: the simplest thing that works

There are many ways to combine two ranked lists. The most-widely-used and most-robust is reciprocal rank fusion (RRF). The whole algorithm is four lines of code. It does not require knowing the absolute scores from either method (which are on different scales and not directly comparable). It only uses the rank — position 1, position 2, etc. — which is unit-free.

The formula for a document's combined score:

combined_score(doc) = sum over all methods of: 1 / (k + rank_in_method)

Where k is a small constant (60 is the standard). If a doc appears in method A at rank 3, it contributes 1 / (60 + 3) = 0.0159. If the same doc also appears in method B at rank 8, it additionally contributes 1 / (60 + 8) = 0.0147. A doc that's high in both methods gets a sum that beats a doc that's high in only one.

In code:

def rrf_fuse(lists: list[list[tuple[int, str, float]]], k: int = 60, top: int = 5) -> list[tuple[int, str, float]]:
    scores: dict[int, float] = {}
    texts: dict[int, str] = {}
    for ranked_list in lists:
        for rank, (doc_id, text, _score) in enumerate(ranked_list):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
            texts[doc_id] = text
    fused = sorted(scores.items(), key=lambda x: -x[1])[:top]
    return [(doc_id, texts[doc_id], score) for doc_id, score in fused]

# Usage:
def hybrid_search(query: str, k: int = 5) -> list[tuple[int, str, float]]:
    dense = dense_search(query, k=20)
    sparse = sparse_search(query, k=20)
    return rrf_fuse([dense, sparse], top=k)

That's the entire hybrid-search system. Two queries, one fuse function, one final ranked list. Fewer than fifty lines of Python. Drop it into the RAG pipeline we built in B3.1, and your retrieval quality jumps meaningfully on the exact kinds of queries that were failing.


How much better is "meaningfully better"?

Honest numbers, from my own projects and from public benchmarks. Take them with error bars because the gain depends enormously on the query and corpus mix.

  • Pure generic semantic queries (no proper nouns, no codes): hybrid gives you nothing over dense. Maybe 1–2 percentage points of recall@5.
  • Mixed query traffic (some generic, some with specific tokens): hybrid gives you 5–15 percentage points of recall@5. This is where most real apps live.
  • Technical/product/code-heavy corpora (documentation, error codes, SKUs, APIs): hybrid can give you 20+ percentage points. This is where hybrid pays for itself on day one.

If your app is a general Q&A bot over generic content, you can skip hybrid and pure dense is fine. If your app is anything involving product names, internal jargon, technical identifiers, or the specific things users copy-paste from their screens, you want hybrid, and you want it before you ship.


Rerankers: the optional upgrade after hybrid

Once you have hybrid retrieval returning a good top-20 and RRF fusing them into a good top-5, the next optional improvement is a reranker. A reranker is a second-pass model that takes the query and each candidate chunk together and produces a relevance score — more accurate than either dense or sparse alone, but expensive enough that you only run it on a small candidate set (the top 20 or 50, not the whole corpus).

# Pseudocode for reranker flow
def rerank(query: str, candidates: list[tuple[int, str, float]], k: int = 5):
    pairs = [(query, c[1]) for c in candidates]
    scores = reranker_model.score(pairs)  # one call, batch of pairs
    scored = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in scored[:k]]

def hybrid_plus_reranker(query: str, k: int = 5):
    dense = dense_search(query, k=20)
    sparse = sparse_search(query, k=20)
    fused = rrf_fuse([dense, sparse], top=20)
    return rerank(query, fused, k=k)

Rerankers (Cohere Rerank, Voyage Rerank, bge-reranker on HuggingFace) are typically a separate API call or a local model. They add latency (50–200 ms) and cost. They improve recall@k by another 5–15 percentage points on top of hybrid — meaningful on hard retrieval problems, overkill on easy ones.

When to add a reranker: after hybrid retrieval is in place, your eval still shows room for improvement, and you have latency budget to spend. Not before. Don't stack layers of complexity on a pipeline you haven't measured.


Tuning hybrid: a short list of real knobs

  • Retrieve more, fuse harder. Take top-20 from each method, fuse, keep top-5. Wider candidate sets give RRF more to work with.
  • Try k=10 and k=60 in the RRF formula. Smaller k makes the top-ranked results count for more, which helps when one method is clearly better than the other. 60 is the default; 10 is worth trying if dense always beats sparse (or vice versa) on your data.
  • Weight the methods if you find one is consistently better. RRF can be generalised to w_method / (k + rank), giving one method more influence. Use sparingly — usually the plain version is fine.
  • Filter candidates on metadata before fusion. If you know the user is looking within a specific document set, apply a WHERE document_id IN (...) to both queries. This is another advantage of co-locating vectors and metadata in Postgres — the filter is free.
  • Use query expansion for short queries. For a 2–3 word query, expand to a fuller question before searching: "password reset" → "How do I reset my password? I forgot my login and can't access my account." Run both searches on the expanded form. This helps short queries more than any tuning inside the retrieval system.

Admit what breaks

  • Hybrid is not strictly better. On purely semantic queries with no specific tokens, pure dense is often slightly better than hybrid (the BM25 results are noise). Measure, don't assume.
  • Language support for sparse search varies. Postgres tsvector supports many languages but not all with equal quality. For non-English, check that your stemmer and stopword list are right for the language.
  • BM25 is sensitive to preprocessing. Stemming, stopword removal, and tokenization choices can change your results by 10 points. Use the defaults until you're measuring with an eval set.
  • RRF is order-dependent if your k is tiny. With k=10 and one method returning a doc at rank 1 while the other returns it at rank 5, the fused score is heavily weighted toward the first method. The default k=60 is forgiving; small k is not.
  • Hybrid doubles the latency floor of retrieval. Two queries in parallel (if your DB supports concurrent connections) add maybe 20% to latency. Two queries serially double it. Use a thread pool or asyncpg to run them concurrently.
  • Metadata filters must be consistent across both queries. If dense retrieval filters by author_id = 42 and sparse retrieval doesn't, you get phantom results in the fused list. Always apply the same filters to both branches.

What just changed in your code

  • Add BM25 alongside dense retrieval. If you're on Postgres, this is two DDL statements and one query. If you're on SQLite, use FTS5. If you're on a vector DB, check whether it has a built-in sparse index (most now do) or stand up Elasticsearch beside it.
  • Fuse results with reciprocal rank fusion. Thirty lines of Python. Do not invent your own fusion formula.
  • Always run both queries concurrently, not sequentially. Async or threads.
  • Measure recall@5 with and without hybrid on your eval set. On queries with proper nouns, exact codes, or technical identifiers, hybrid should win by a clear margin. On generic queries, it should win narrowly or tie.
  • Apply the same filters to both branches. Metadata conditions, tenant boundaries, date ranges — identical on both sides.
  • Hold off on rerankers until hybrid has proven itself and your eval still has room.

Next post, B3.5, is the honest accounting: even with perfect chunking, hybrid search, and a well-tuned model, RAG systems fail in specific, repeatable ways that you will see in production. I'll walk through the seven failure modes I've actually hit on shipped products, with how to detect each one from logs, and how to address them without blaming the model.


Course navigation

⬅️ Previous📍 You are hereNext ➡️
⬅️ Previous
B3.3 · Chunking Is the Whole Game
B3.4 of B6.4Next ➡️
B3.5 · RAG Failure Modes

📚 AI for Builders · Course Home — 28 posts, six modules.


Cover photo via Unsplash. This post is part of the AI for Builders series.

More from this blog

Learn AI - Zero to Hero

111 posts