Retrieval-Augmented Generation (RAG) has become the go-to pattern for letting LLMs search external data. In this post, I’ll break down how RAG works and talk about embeddings, chunking, and search strategies.

Information Retrieval Recap

First, let’s review some information retrieval fundamentals, especially in the context of LLMs and how search and indexing work. I’ve depicted search strategies as different spectra in the following infographic:

Information Retrieval Continuum

On the flexibility spectrum, we have rigid and flexible at opposite ends. A classic SQL query with a WHERE name='Alf' predicate would be symbolic, exact, and very deterministic. Even a WHERE name ILIKE 'alf%' predicate or one with a regular expression is essentially exact-like search.

Full-Text Search (FTS) uses database-provided functionality that has an understanding of language, such as grammar and stop words, to turn documents into preprocessed documents optimized for searching. This includes spell-correcting words before indexing, mapping synonyms to a single word, and basically canonicalizing words so that documents are easier (i.e., more efficient) to search.

Exact retrieval uses rule-based symbolic logic to find data. It’s like asking for a specific row in a spreadsheet by knowing some of the column values.

DuckDB SQL: SELECT gpu_id, AVG(temp) OVER (PARTITION BY gpu_id ORDER BY time ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS rolling_avg FROM gpu_temp_readings WHERE gpu_type='nvidia_h100'

Lexical retrieval uses natural language keyword-based terms to find data. A preprocessing step would in turn have mapped keywords in the data to symbolic values. Query keywords are converted into the same dictionary of symbols before searching.

PostgreSQL FTS: SELECT title, ts_rank_cd(textsearch, query) AS rank FROM documents, websearch_to_tsquery('english', '"black hole" -supernova') AS query WHERE textsearch @@ query ORDER BY rank DESC LIMIT 10

Conceptual retrieval converts data into vectors, akin to meaning-based symbols. A suitable embedding model is needed to accurately encode vectors with similar meaning close together.

SQLite with USearch extension: SELECT id, distance_cosine_f32(vt.vector, '[7.0, 8.0, 9.0]') AS distance FROM vectors_table AS vt

Note that even though these examples showcase SQL syntax, the underlying searching and ranking work differently.

Query Retrieval Accuracy Cost
Exact predicates Exact High Low
Keyword-based Lexical Medium-High Medium
Relationship-based Conceptual Variable High

Conceptual retrieval can be more expensive due to needing to create an embedding for each piece of data and for each query. Depending on the embedding model, the resulting vectors could have high dimensionality, which increases resource usage.

Embeddings

An embedding is a vector generated by a trained ML model, known as an embedding model, which takes input in the form of text, images, videos, etc. and turns them into an array of floating point numbers, a vector. In a text embedding, words that are semantically related would get turned into vector values that are clustered close together. The definition of similarity is dependent on the embedding model and the data it has been trained on.

If you were to handcraft a 2-dimensional embedding, it might look like this:

  • cat = [1.0, 0.2]
  • dog = [0.7, 0.4]
  • car = [-0.8, 1.0]

Maybe the first dimension captures the “domesticated-ness” and the second the “size”. But realistically, this 2D embedding doesn’t have enough dimensionality to indicate similarity in the real world.

In high-dimensional embeddings, each item is represented by a more dense vector:

  • cat = [0.12, -0.58, ..., 0.09]

In N-dimensional vectors, individual dimensions don’t have discernible meanings because the individual dimensions have been learned through model training. In modern embeddings (e.g., BERT, text-embedding-3-large), the semantic meaning of individual dimensions are often a black box. Collectively, dimensions capture words that are used in similar contexts or latent concepts like “pet”, “machine”, “language”, etc.

While some embedding models are optimized to run on CPUs, many require a GPU. You’ll need to generate embeddings in real-time for user queries at search time, but since indexing is an asynchronous background process, embedding latency is less of a concern there.

Chunking

Most text embedding models perform best when given small phrases, such as sentences, instead of entire paragraphs. Chunking is the process of splitting a document into smaller chunks. It’s non-trivial to design a chunking strategy without losing context. For example, if you:

  • Chunk every N bytes: risk breaking words
  • Chunk per word: lose context within sentence
  • Chunk per sentence: lose context within paragraph
  • Chunk per paragraph: lose per-sentence subtlety

As you’ll see later, vector search is done on the chunks, not the original documents!

In many cases, chunking per sentence is a good enough approach when dealing with long-form text, but more elaborate strategies include:

  • Contextual chunks: add metadata to each chunk to preserve meaning
  • Overlapping chunks: include leading and trailing text so surrounding context is retained
  • Hierarchical chunks: incorporate document structure into chunks (e.g., a webpage’s heading tags into each chunk)

Vector search accuracy depends on choosing a good chunking strategy, and sometimes, a combination of many!

In the case of structured data, like a spreadsheet, how would you chunk it? If you decide to chunk it by each cell, it’ll likely have lost much of its meaning. By using chunk enrichment, you can reintroduce metadata into the chunk to increase its relevancy in search. However, by adding back metadata, you’re effectively denormalizing the data closer to its original form. It raises the question of whether conceptual search is even the right tool, or if you should stick to exact querying instead.

At the same time, it’s worth acknowledging that a generic solution is much harder to build than one tailored to your specific dataset. Keep your chunking strategy adaptable and know your data!

Let’s return to the problem getting an LLM to query our dataset. If you have a relational dataset, you could prompt the LLM with the database schema, the natural language query, and have it generate exact (classic SQL) or lexical (keyword-style) queries and execute it against the database to return the results. However, this is a potentially catastrophic design from a security perspective and it doesn’t naturally leverage the strengths of an LLM, which is its ability to generate semantic queries and understand a user’s intent.

Retrieval-Augmented Generation (RAG) is a popular LLM design pattern for integrating conceptual retrieval. An LLM’s training data includes a corpus of text and understanding of concepts and semantics, so it’s logical to let it refine the user’s intent (query) into a set of keywords. This query rewriting phase takes the user’s raw query, which may be as mundane as “are elephants vegetarian?” and rewrites it into semantically-related phrases like:

  • elephant diet
  • what do elephants eat
  • herbivorous animals: elephants

These subqueries are independently embedded and the vector representations are used to search a vector database for similar vectors. The resulting vectors are mapped back to the original data.

These subqueries are then sent to an embedding model, which converts each phrase into a vector. Vectors that are related are designed to be closer to each other in the vector space (dictated by the embedding model). This is what makes vector retrieval work. If your choice of embedding model is bad, your results will be, too. This is why, as we’ll later see, it makes sense to understand your documents more. An embedding model trained on Internet-wide text may not be good enough for domain-specific text, resulting in poor search results.

Visually, it can be summarized as follows:

Retrieval-Augmented Generation (RAG)

In the initial step, the LLM does a query rewrite, clarifying the query and increasing results matching potential. But keep in mind that if the LLM training data isn’t aligned with the embedding model, it may not semantically rewrite queries into the most accurate subqueries. This is why it’s ideal to know your dataset beforehand and choose the appropriate embedding model, but this isn’t always possible.

Vector search uses Approximate Nearest Neighbor (ANN) algorithms like HNSW and IVF to search large sets of vectors quickly. Distance metrics like cosine similarity and Euclidean distance measure the closeness between two vectors. I won’t discuss this here because I don’t understand them enough, and you can get reasonably far by using the common defaults of HNSW and cosine similarity to start with.

The Future

RAG is useful today, but will it stand the test of time?

I suspect it’ll become less of a buzzword that people over-eagerly reach for, and instead be a well-used pattern. For it to become useful for large and heterogeneous datasets, it’ll need to incorporate smaller and faster language models instead of involving an LLM in every part of the search step. Perhaps we’ll see dedicated smaller models for query rewriting and results reranking, eliminating a major latency bottleneck when dealing with search on larger datasets.

I’ll leave it up to you to decide if it’s worth investing in RAG or if you should focus on understanding your data and the fundamentals of search like preprocessing documents, indexing, and reranking.