AHXproject

Ubuntu => Ubuntu Blog => Topic started by: tim on Apr 20, 2026, 10:42 AM

Title: Hybrid search and reranking: a deeper look at RAG
Post by: tim on Apr 20, 2026, 10:42 AM
Hybrid search and reranking: a deeper look at RAG

Many of us are familiar with the retrieval augmented generative AI (RAG) pattern for building agentic AI applications – like digital concierges, frontline support chatbots and agents that can help with basic self-service troubleshooting. 

At a high level, the flow for RAG is fairly clear – the user's prompt is augmented with some relevant contextual information from a knowledge base, and the large language model (LLM) provides the user with a response on the basis of the information provided, instead of from the "baked in" information that it was originally trained on.

In this article, we'll roll up our sleeves and dive a little deeper to try to get a better grasp of how typical production-grade RAG systems actually work. To understand what's really going on in the information retrieval process, we need to dig into hybrid search and reranking.

Embeddings and vector search

Before we get to hybrid search and reranking, let's establish some baseline RAG understanding. Vector databases essentially provide a geometry based search index that can help to find relevant content – or knowledge – in our knowledge base. The way it works is this:


The results of the vector search will be text chunks from the raw source data, which will be sent to the LLM along with the user's prompt. The encoded vector embeddings help to find the right information in the knowledge base, but LLMs can't interpret those vector embeddings directly.


It's important to note that first step. When you run the search, the user's prompt has to be converted to vectors too, at runtime. This is so that the vector search engine can compare the vectors in the prompt with the vectors in the database and find the nearest matches.

In order to get meaningful results, you need to use the same embedding model as you used when you created the vector index in the database. This is because each model creates its own unique "map" of meaning (often called a vector space). Using a different embedding model is a silent killer – the application will run without errors, but the information retrieved will be completely irrelevant.

So that's vector search. But, to make a RAG system production-ready, you typically need to move beyond "naive" vector search to a multi-stage retrieval process.

Hybrid search

Hybrid search runs both vector search and full text search algorithms in parallel and merges their results.


These two methods use different scoring systems (0.0 to 1.0 for vectors vs. unbounded scores for BM25). Thus they're combined using an algorithm like reciprocal rank fusion (RRF). RRF looks at the position of a document in both lists rather than the raw score, giving a higher final rank to documents that appear near the top of both.

Calculating the combined score

There are actually two approaches that are typically found for calculating the combined ranking score.

Reciprocal rank fusion (RRF)

RRF ignores the raw scores entirely and only looks at the rank (1st, 2nd, 3rd, etc.) of the document in each list. RRF uses a smoothing factor – typically this is set to the whole number `60`. This standard "smoothing" factor prevents a single very high rank in one search from completely drowning out a moderate rank in the other.

Relative score fusion (weighted average)

This method keeps the raw scores but normalizes them first.


RRF doesn't need much tweaking and is pretty robust, while weighted average can give more surgical control if you know your keyword search is consistently more reliable than your vector search (or vice-versa).

Hybrid search in the database

In modern RAG stacks, the fusion of full text keyword search (BM25) and vector search is increasingly moving into the database layer to reduce latency and "glue code" – middleware logic which might otherwise increase the overall management complexity of the solution.

PostgreSQL hybrid search with RRF

Let's see what hybrid search might look like in PostgreSQL. This example assumes you have a table called documents with a content column (for text search) and an embedding column (for vector search).

WITH
-- 1. Semantic Search: Rank by vector similarity
semantic_search AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding '[0.1, -0.2, ...]'::vector) as rank
FROM documents
ORDER BY embedding '[0.1, -0.2, ...]'::vector
LIMIT 50
),
-- 2. Keyword Search: Rank by text relevance (BM25 or ts_rank)
keyword_search AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC) as rank
FROM documents, plainto_tsquery('english', 'your search terms') query
WHERE to_tsvector('english', content) @@ query
LIMIT 50
),
-- 3. Fusion: Combine ranks using the RRF formula: 1 / (rank + k)
combined_results AS (
SELECT id, 1.0 / (60 + s.rank) as score FROM semantic_search s
UNION ALL
SELECT id, 1.0 / (60 + k.rank) as score FROM keyword_search k
)
-- 4. Final Output: Sum scores for items found in both/either list
SELECT d.id, d.content, SUM(c.score) as final_rrf_score
FROM combined_results c
JOIN documents d ON c.id = d.id
GROUP BY d.id, d.content
ORDER BY final_rrf_score DESC
LIMIT 10;

The benefits of running this in a stored procedure inside the database are


You might be wondering about some of the unusual operators and datatypes used in the above code listing. The `` operator is pg_vector's cosine distance operator, used to calculate the distance between two vectors. The `[0.1, 0.2 ... ]` signifies an example of how a vector is represented. On the other hand, `@@` is the operator used by the full text search engine to check for a match between the search query and a field.

Reranking

A reranker is a second-stage AI model that improves the accuracy of the contextual information that gets sent to the LLM along with the user's prompt. While vector search is fast but fuzzy, a reranker is slower but more accurate.

For reranking, you don't need to encode the data into embeddings. Instead, you send the raw text of the query and the raw text of the result chunks directly to the reranker model.

The workflow is:

How reranking works

A reranking model is typically a cross-encoder. Unlike the embedding models (bi-encoders) used for the initial search, a cross-encoder does not look at the query and document separately.

Reranking in application middleware

Reranking is almost always handled in the application layer or via a dedicated inference endpoint. Database stored procedures are great at maths (RRF), but they aren't ideal for running heavy deep-learning models like the cross-encoder models typically used for reranking.

Wrap-up

So we've examined in more detail how modern, production-grade RAG architectures use hybrid search and reranking to get the best results (and user experience):


Hybrid search helps make sure that you don't miss anything, and reranking helps to make sure the best stuff is at the very top.

The complete production RAG flow typically looks like this:

TaskLocationWhy?Calculate embeddingsMiddleware / APIRequires a GPU heavy model to compute embeddingsStore embeddingsDatabaseEmbeddings need to be stored for later search  and retrievalVector/keyword searchesDatabaseNeeds direct access to the database indexes.Hybrid searchDatabase stored procedureFaster to rank at the source than fetching two huge lists to the app.RerankingMiddleware / APIRequires a GPU-heavy Cross-Encoder model.LLM generationMiddleware / APIThe final step that uses the retrieved context.

Learn more about Canonical's OpenSearch (https://canonical.com/data/opensearch)  and PostgreSQL (https://canonical.com/data/postgresql)  solutions, or get in touch (https://ubuntu.com/ai/contact-us) .

Further reading

Explore our "Guide to RAG" (https://ubuntu.com/engage/rag-explained)  to build and deploy a RAG workflow on public clouds with open source tools. 

Many of us are familiar with the retrieval augmented generative AI (RAG) pattern for building agentic AI applications – like digital concierges, frontline support chatbots and agents that can help with basic self-service troubleshooting.  At a high level, the flow for RAG is fairly clear – the user's prompt is augmented with some relevant contextual [...]


Categories: AI, RAG, RAG Explained
Source: https://ubuntu.com//blog/hybrid-search-and-reranking-a-deeper-look-at-rag Apr 20, 2026, 10:12 AM