Skip to main content

Command Palette

Search for a command to run...

Advanced RAG with Vector DB and Retrievers

Updated
17 min read
Y

Developer | Adept in software development | Building expertise in machine learning and deep learning

Advanced Retrievers for RAG

RAG pipeline has 2 separate phases

Phase 1 : Ingestion (offline/preprocessing) 1) Chunking 2) Embedding 3) Indexing/Storing

Phase 2: Retrieval (online at query time) this is where Retriever runs,

take user query → embed the query → search the vector database → return the most relevant chunks

A retriever is an interface designed to return documents based on an unstructured query. Unlike a vector store, which stores and retrieves documents, a retriever's primary function is to find and return relevant documents. While vector stores can serve as the backbone of a retriever, there are various other types of retrievers that can be used as well.

Retriever may depends on a vector database, and a LLM, using these components but not creating them.

Exlore advanced retriever in Langchain

In the lecture, three types of retrievers are mentioned:

  1. Vector Store Based Retriever: Retrieves documents from a vector database using embeddings.

  2. Multi-Query Retriever: Uses an LLM to create different versions of a query, generating a richer set of retrieved documents. similar to 1

  3. Self-Query Retriever: Converts a query into two components: a semantic string and a metadata filter.

Vector store-based retriver

Self Query Retreiver

A Self‑Query Retriever is just a smart wrapper around a VectorStoreRetriever.

Its job is:

  1. Use an LLM to interpret the user query
    → Extract filters, metadata conditions, semantic meaning

  2. Then delegate the actual retrieval to a vector database
    → via similarity search + metadata filtering

So the architecture is:

SelfQueryRetriever
     ↓ uses
VectorStoreRetriever
     ↓ uses
Vector Database (FAISS, Chroma, Pinecone, etc.)

The self‑query retriever can’t work without a vector DB, because the whole retrieval step depends on stored embeddings.

A Self‑Query Retriever uses an LLM to extract meaning and filters from the user query before doing the actual retrieval.

So the flow looks like this:

User Query
     ↓
LLM (interpret / extract filters)
     ↓
Self‑Query Retriever
     ↓
Vector Database (semantic search + metadata filters)
     ↓
Relevant Chunks

Multi-Query Retriever

The MultiQueryRetriever can potentially overcome some limitations of distance-based retrieval, resulting in a richer and more diverse set of results.

Distance-based vector database retrieval represents queries in high-dimensional space and finds similar embedded documents based on "distance". However, retrieval results may vary with subtle changes in query wording or if the embeddings do not accurately capture the data's semantics.

The MultiQueryRetriever addresses this by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and then takes the unique union of these results to form a larger set of potentially relevant documents.

Parent document Retriever

The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent IDs for those chunks and returns those larger documents.

When splitting documents for retrieval, there are often conflicting desires:

  1. You may want to have small documents so that their embeddings can most accurately reflect their meaning. If the documents are too long, the embeddings can lose meaning.

  2. You want to have long enough documents so that the context of each chunk is retained.

Search Method

MMR in vector stores is a technique used to balance the relevance and diversity of retrieved results. It selects documents that are both highly relevant to the query and minimally similar to previously selected documents. This approach helps to avoid redundancy and ensures a more comprehensive coverage of different aspects of the query.

# MMR
retriever = vectordb.as_retriever(search_type="mmr") docs = retriever.invoke(query) docs

# Similarity score threshold retrieval
# defines a similarity score threshold, returning only documents with a score above that threshold.
retriever = vectordb.as_retriever( search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4} ) docs = retriever.invoke(query) docs

Advanced Retriever in LlamIndex

VectorStoreIndex > Utilizing vector embeddings for semantic search, ideal for large model applications. (document chunking), using embeddings to find relevant content. for general purpose search. common in RAG.

DocumentSummaryIndex> Generates summaries to filter documents, useful for large and diverse document, useful for large and diverse document sets. filter by summary first.

LLM-Based, time consuming and expensive; Embedding-based : uses semantic similarity

KeywordTableIndex : Extracts keywords for exact matching suitable for rule-based or hybrid searches.

The KeywordTableIndex> 1) Extract keywords from documents, 2) maps keywords to specific chunks of content. 3) Enable exact keyword matching, 4) useful for hybrid or rule-based search.

Core and advanced retriever - TF(Term Freq.)-IDF(Inverse Document Freq.)

Understanding TF-IDF: The Foundation

Before diving into BM25, let's understand TF-IDF (Term Frequency-Inverse Document Frequency), which BM25 builds upon:

TF is term frequency within one document, and IDF measures how rare that term is across all documents.

Term Frequency (TF): Measures how often a word appears in a document

  • Example: If "neural" appears 3 times in a 100-word document, TF = 3/100 = 0.03

Inverse Document Frequency (IDF): Measures how rare a word is across all documents

  • Example: If "neural" appears in only 2 out of 1000 documents, IDF = log(1000/2) = 6.21

  • Common words like "the" have low IDF; rare technical terms have high IDF

TF-IDF Score: TF × IDF

  • Highlights words that are frequent in one document but rare across the collection

  • Developed by Karen Spärck Jones, who pioneered the concept of term specificity

Core and advanced retrievers - auto merging retriever

Summary and Highlights: Advanced Retrievers for RAG

Congratulations! You have completed this lesson. At this point in the course, you know:

  • A LangChain retriever is an interface that returns documents based on an unstructured query

  • There are several different types of LangChain retrievers

  • The vector store-based retriever retrieves documents from a vector database

  • A vector store-based retriever can be created directly from the vector store object with the retriever method by using similarity search or MMR

  • That similarity search is when the retriever accepts a query and retrieves the most similar data

  • MMR is a technique used to balance the relevance and diversity of retrieved results

  • The multi-query retriever uses an LLM to create different versions of the query, generating a richer set of retrieved documents

  • The self query retriever converts the query into two components, a string to look up semantically, and a metadata filter to accompany it

  • The parent document retriever has two text splitters: a parent splitter that splits the text into large chunks to be retrieved, and a child splitter that splits the document into small chunks to generate meaningful embeddings

  • The core LlamaIndex index types are the VectorStoreIndex, the DocumentSummaryIndex, and the KeywordTableIndex

  • The VectorStoreIndex stores vector embeddings for each document chunk, is best suited for semantic retrieval, and is commonly used in pipelines that involve large language models

  • The DocumentSummaryIndex generates and stores summaries of documents, which are used to filter documents before retrieving the full content, and is useful when working with large and diverse document sets

  • The KeywordTableIndex extracts keywords from documents and maps them to specific content chunks, and is useful in hybrid or rule-based search scenarios

  • The Vector Index Retriever uses vector embeddings to find semantically related content, and is ideal for general-purpose search and RAG pipelines

  • The BM25 Retriever is a keyword-based method for ranking documents, and it retrieves content based on exact keyword matches rather than semantic similarity

  • The Document Summary Index Retriever uses document summaries instead of the actual documents to find relevant content

  • There are two versions of the Document Summary Index Retriever, one uses LLM, and the other uses semantic similarity

  • The Auto Merging Retriever preserves context in long documents using a hierarchical structure, and uses hierarchical chunking to break documents into parent and child nodes

  • The Recursive Retriever follows relationships between nodes and uses references such as citations in academic papers or metadata links

  • The Query Fusion Retriever combines results from different retrievers using fusion strategies

  • The fusion strategies supported by the Query Fusion Retriever are Reciprocal Rank Fusion, Relative Score Fusion, and Distribution-Based Fusion

Build a Compensive RAG Application

Introduction to FAISS for RAG

FAISS is a lib, and deployed in a single node, provide smilarity search.

FAISS doen't support meta data search. Chroma support similarity and meta

FAISS offer many different index options, but Chroma only using HNSW(hierachial Navigable SMALL world)

Flat index(index flat) , comparing Euclidean distances(L2), or dot product, between the query embedding and the embedding of every vector in the vector store, retrieving "k nearest vectors". very accurate method, but very slow. If FAISS uses IndexFlat, then it must search and compare against every vector in the data store.

Inverted file Index(IVF), using k-mean method to cluster vectors in a nearest cell(predefined the number of centroids). presenting a vector to compare with centroids, and find most relevant cluster. this is faster than a flat index,

Locality-Sensitive Hashing(LSH): using hash functions to group similar vectors, fast and memory efficient, best for high-dimensional sparse data; Best for high-dim sparse data like, like text embeddings;

FAISS giving a control on the indexing, without meta data and deployed to single node

Chroma DB is easy to deploy and supports meta data, but only index in HNSW.

Apendix A What are Advanced Retrievers?

Advanced retrievers in LlamaIndex are sophisticated components that go beyond simple vector similarity search to provide more nuanced, context-aware, and intelligent information retrieval. They combine multiple techniques such as:

  • Semantic Understanding: Using embeddings to understand meaning and context

  • Keyword Matching: Precise term-based search for exact specifications

  • Hierarchical Context: Maintaining relationships between different levels of information

  • Multi-Query Processing: Generating and combining results from multiple query variations

  • Fusion Techniques: Intelligently combining results from different retrieval methods

Why are Advanced Retrievers Important?

  1. Improved Accuracy: Advanced retrievers can find more relevant information by using multiple search strategies

  2. Better Context Preservation: They maintain important relationships between pieces of information

  3. Reduced Hallucination: More precise retrieval leads to more accurate AI responses

  4. Scalability: Efficient retrieval strategies work better with large document collections

  5. Flexibility: Different retrieval methods can be combined for optimal results

Index Types Overview

Before exploring advanced retrievers, it's helpful to first understand the three main index types supported by LlamaIndex. Each is designed to support different retrieval scenarios:

VectorStoreIndex:

  • Stores vector embeddings for each document chunk

  • Best suited for semantic retrieval based on meaning

  • Commonly used in LLM pipelines and RAG applications

DocumentSummaryIndex:

  • Generates and stores summaries of documents at indexing time

  • Uses summaries to filter documents before retrieving full content

  • Especially useful for large and diverse document sets that cannot fit in the context window of an LLM

KeywordTableIndex:

  • Extracts keywords from documents and maps them to specific content chunks

  • Enables exact keyword matching for rule-based or hybrid search scenarios

  • Ideal for applications requiring precise term matching

Appendix B Cheat Sheet: Advanced Retrievers for RAG cognitiveclass.ai logo

What are Advanced Retrievers? Advanced retrievers go beyond simple vector similarity search to provide more nuanced, context-aware information retrieval through:

Semantic Understanding: Using embeddings for meaning and context Keyword Matching: Precise term-based search for exact specifications Hierarchical Context: Maintaining relationships between information levels Multi-Query Processing: Generating and combining results from multiple query variations Fusion Techniques: Intelligently combining results from different retrieval methods Maximum Marginal Relevance (MMR) Purpose: Balance relevance and diversity of retrieved results Method: Selects documents that are highly relevant to the query AND minimally similar to previously selected documents Benefit: Avoids redundancy and ensures comprehensive coverage of different query aspects

LlamaIndex Retrievers Core Index Types in LlamaIndex VectorStoreIndex Function: Stores vector embeddings for each document chunk Best suited for: Semantic retrieval based on meaning Usage: Commonly used in LLM pipelines and RAG applications DocumentSummaryIndex Function: Generates and stores summaries of documents at indexing time Process: Uses summaries to find and retrieve relevant documents Best for: Large documents whose meanings would be lost by chunking; large documents that cannot fit in LLM or embedding model context windows Key Points: Returns original documents, not their summaries; uses summaries instead of text chunks to enable retrieval based on the semantic meaning of the entire text KeywordTableIndex Function: Extracts keywords from documents and maps to content chunks Best for: Exact keyword matching for rule-based or hybrid search scenarios Use Case: Applications requiring precise term matching LlamaIndex Retriever Types

  1. Vector Index Retriever Most common retriever - uses vector embeddings to find semantically related content

Process: Embeds query, compares with document embeddings using cosine similarity Ideal for: General-purpose search, RAG pipelines where semantic understanding is crucial Limitation: May miss exact keyword matches when specific terms are crucial 2. BM25 Retriever Advanced keyword-based retrieval that improves on TF-IDF

TF-IDF Foundation:

Term Frequency (TF): How often a word appears in a document Inverse Document Frequency (IDF): How rare a word is across all documents TF-IDF Score: TF × IDF BM25 Improvements:

Term Frequency Saturation: Reduces impact of repeated terms using saturation function Document Length Normalization: Adjusts for document length, preventing long document bias Tunable Parameters: k1≈1.2 (saturation control), b≈0.75 (length normalization) Best for: Technical documentation, legal documents, exact terminology requirements

  1. Document Summary Index Retrievers Two Variants:

DocumentSummaryIndexLLMRetriever: Uses LLM to analyze query against summaries (intelligent but expensive) DocumentSummaryIndexEmbeddingRetriever: Uses semantic similarity between query and summary embeddings (faster, cost-effective) Process: Two-stage approach using summaries to filter documents, then returns full document content

  1. Auto Merging Retriever Purpose: Preserves context in long documents using hierarchical structure Method:

Uses hierarchical chunking (parent and child nodes) If enough child nodes from same parent are retrieved, returns parent node instead Dual Storage: Child chunks for precise matching, parent chunks for context Best for: Long documents, legal papers, technical specifications needing context preservation

  1. Recursive Retriever Purpose: Follows relationships between nodes using references Capability: Can follow references from one node to another (citations, metadata links) Types: Supports chunk references and metadata references Best for: Academic papers with citations, interconnected knowledge bases

  2. Query Fusion Retriever Purpose: Combines results from different retrievers and optionally generates multiple query variations

Core Capabilities:

Multiple retriever support (combines vector-based and keyword-based methods) Query variation generation using LLM Sophisticated fusion strategies to improve recall Three Fusion Modes:

Reciprocal Rank Fusion (RRF) Most robust fusion method - combines ranked lists using reciprocal of ranks Formula: RRF_score(d) = Σ (1 / (rank_i(d) + k)) where k≈60 Best for: Default choice for most fusion scenarios, production systems Relative Score Fusion Preserves score magnitudes while normalizing across query variations Formula: normalized_score = original_score / max_score Best for: When embedding model confidence scores are meaningful Distribution-Based Score Fusion Most sophisticated - uses statistical properties of score distributions Methods: Z-score normalization, percentile ranking Best for: Complex queries with varying score distributions LangChain Retrievers LangChain Retriever Interface Definition: "An interface that returns documents based on an unstructured query"

More general than a vector store Accepts string query as input, returns list of documents as output Doesn't necessarily store documents - purpose is to retrieve them LangChain Retriever Types

  1. Vector Store-Backed Retriever Foundation retriever - lightweight wrapper around vector store class Search Types:

Simple Similarity Search: Returns documents ranked by similarity (default 4 results) MMR Search: Balances relevance and diversity to avoid redundancy Similarity Score Threshold: Returns only documents above specified threshold 2. Multi-Query Retriever Problem Addressed: "Distance-based vector database retrieval may vary with subtle changes in query wording"

Solution Process:

Uses LLM to generate multiple queries from different perspectives For each query, retrieves set of relevant documents Takes unique union of results for larger set of potentially relevant documents Benefit: "By generating multiple perspectives on the same question, the MultiQueryRetriever can potentially overcome some limitations of distance-based retrieval"

  1. Self-Querying Retriever Core Capability: "Has the ability to query itself"

Process: Converts natural language query into structured query with two components:

String to look up semantically Metadata filter to accompany it Requirements: Documents must have rich, structured metadata with field descriptions Best for: Applications combining semantic search with attribute filtering

Example Queries:

"I want to watch a movie rated higher than 8.5" (filter only) "Has Greta Gerwig directed any movies about women" (query + filter) 4. Parent Document Retriever Problem Solved: "Conflicting desires" when splitting documents:

Small documents for accurate embeddings Large documents for context retention Solution: "Strikes that balance by splitting and storing small chunks of data"

Process:

During retrieval, first fetches small chunks Looks up parent IDs for those chunks Returns larger documents containing the small chunks Architecture:

Two splitters: Parent (large chunks for retrieval) and child (small chunks for embeddings) Dual storage: Vector store for embeddings, document store for parent documents Decision Framework Need LlamaIndex Choice LangChain Choice Exact keyword matching BM25 Retriever Vector Store-Backed + custom keyword logic Multi-query with fusion Query Fusion Retriever (RRF/Relative/Distribution) Multi-Query Retriever (union approach) Citation following Recursive Retriever Not directly supported Hierarchical context Auto Merging Retriever Parent Document Retriever Simple semantic search Vector Index Retriever Vector Store-Backed Retriever

Apendix C (H)ierarchical (N)avigable (S)mall (W)orld

HNSW finds the nearest vectors by “walking” through a graph of vectors, instead of comparing the query to every vector.

Think of vectors as cities on a map 🗺️

Imagine:

  • Every vector = a city

  • Distance between vectors = physical distance

  • Query vector = “Where am I now?”

Your goal:

Find the k closest cities to where you are.

Brute force (IndexFlat)

This is like:

“Measure the distance from my location to every city on Earth, then sort.”

✅ Correct
❌ Very slow

HNSW idea: build roads between nearby cities 🚗

Instead of checking every city:

  • Each city has roads to nearby cities

  • You don’t need a map of everything

  • You can walk from city to city, always getting closer

This network of roads is the graph in HNSW.

How HNSW is built (simple view)

HNSW builds a multi‑layer graph:

Bottom layer

  • All vectors

  • Each vector connects to nearby vectors

Upper layers

  • Fewer vectors

  • Longer “highway” connections

Think:

  • Top layers = highways

  • Bottom layer = local streets

How search works (step by step)

1️⃣ Start at the top layer

  • Pick an entry point

  • Only a few nodes exist here

  • Move to the neighbor that’s closer to the query

✅ Fast, coarse movement


2️⃣ Go down one layer

  • Now more nodes exist

  • Continue moving to closer neighbors

✅ More precise


3️⃣ Reach the bottom layer

  • This is where fine‑grained neighbors live

  • Search locally around the best candidates

✅ Good accuracy without full scan


Key idea to remember

HNSW never searches everything.
It only explores vectors that look promising.

That’s why it’s fast.


Is HNSW exact K‑NN?

No — and this is important.

  • HNSW returns approximate nearest neighbors

  • Usually very close to the true nearest ones

  • Much faster than brute force

This trade‑off is intentional.

Compare with what you already know

Index How it searches
IndexFlat Compare with every vector
IVF Search only selected clusters (regions)
HNSW Walk a graph of nearby vectors

HNSW:

  • ❌ No regions

  • ✅ Local neighborhoods

  • ✅ Graph traversal


Because it gives:

  • ⚡ Very fast search

  • 🎯 High recall (often >95%)

  • 📈 Good scalability

That’s why:

  • FAISS

  • Qdrant

  • Weaviate

  • Milvus

all use HNSW or HNSW‑like ideas.


Beginner takeaway (lock this in)

HNSW avoids full comparison by organizing vectors into a layered graph and navigating toward closer neighbors step by step.

If you remember just that, you understand HNSW.


If you want, next we can:

  • Compare HNSW vs IVF in one diagram

  • Or explain why HNSW works so well in high‑dimensional spaces