Blog

What is RAG (retrieval augmented generation)

Jonathan Bree


Language models know a lot. They do not know your documents, your users, or anything that happened after their training cutoff. RAG is the pattern that fixes the first of those problems.

Retrieval augmented generation is an architecture pattern for giving language models access to information they were not trained on. Rather than relying solely on knowledge encoded in model weights during training, a RAG system retrieves relevant information from an external source at query time and includes it in the prompt alongside the question. The model generates its answer from that combined context.

The result is a system that can answer questions about documents it has never seen, knowledge bases that update independently of model training, and domain-specific information that would never appear in general training data. RAG is one of the most widely adopted patterns in production AI systems and the foundation of most enterprise AI applications built in the last two years.


Why RAG exists

Language models are trained on large corpora of text and encode knowledge in their weights. This produces models that are broadly knowledgeable but have two hard limitations.

The first is the training cutoff. A model trained in early 2024 has no knowledge of events after that date. No amount of prompting changes this. The knowledge simply is not there.

The second is specificity. General training data does not include your company's internal documentation, your product's knowledge base, your legal archive, or your customer support history. A model cannot answer questions about these things because it was never trained on them.

Fine-tuning can address the second problem partially, but it is expensive, requires significant data, and produces a model that is static: it encodes a snapshot of the training data at a point in time and cannot be updated without retraining. For knowledge bases that change frequently, fine-tuning is impractical.

RAG solves both problems by separating knowledge storage from the model. Knowledge lives in an external store, updated independently. At query time, relevant knowledge is retrieved and injected into the context. The model generates from current, specific, relevant information without requiring retraining.


How RAG works end to end

RAG has five distinct phases: chunking, embedding, indexing, retrieval, and generation.

1 - Chunking

Before documents can be indexed, they must be split into chunks: smaller units that can be individually embedded and retrieved. A chunk is typically a passage, a paragraph, or a fixed number of tokens, depending on the chunking strategy.

Chunking strategy matters more than most implementations account for. Naive chunking by token count splits documents at arbitrary points, often in the middle of a sentence or concept, producing chunks that lack the context needed to answer a question. A chunk that says "as mentioned above, the deadline is end of Q3" is not retrievable on its own because "as mentioned above" refers to context that is now in a different chunk.

Better chunking strategies include semantic chunking, which splits at natural concept boundaries rather than token counts, and overlapping chunking, which includes a fixed overlap between adjacent chunks to preserve context across boundaries. The right strategy depends on the document type and the expected query patterns.

2 - Embedding

Each chunk is converted into an embedding: a dense vector representation of its meaning in a high-dimensional space. Similar chunks end up close together in that space. The embedding model determines the quality of this representation.

General-purpose embedding models like OpenAI's text-embedding-3 or Cohere's embed models work well across diverse content. Domain-specific content, highly technical material, or non-English text may benefit from models trained or fine-tuned for that domain. The embedding model used at indexing time must match the model used at query time, because embeddings are only comparable within the same model's vector space.

See what is semantic search for a more detailed treatment of how embeddings and vector similarity work.

3 - Indexing

Embeddings are stored in a vector index that supports efficient similarity search. Pinecone, Weaviate, Qdrant, and pgvector are common choices. The index allows fast retrieval of the nearest neighbours to a query embedding across potentially millions of stored chunks.

Most vector indexes also support metadata filtering: restricting retrieval to chunks that match certain attributes, such as document type, date range, or source. This is important for large corpora where unfiltered retrieval would surface too much noise.

4 - Retrieval

At query time, the user's question is embedded using the same model used to index the corpus. The vector index finds the chunks whose embeddings are most similar to the query embedding. The top k chunks, typically between three and twenty depending on context window size and retrieval quality, are returned as the retrieval result.

Some RAG systems add a re-ranking step: a second model evaluates the retrieved chunks for relevance and reorders them before they are passed to the generator. Re-ranking is more expensive than initial retrieval but can significantly improve precision, particularly for queries where the top similarity results are not the most relevant.

Hybrid retrieval combines vector similarity with keyword search, typically BM25, to improve precision on queries containing specific terms that semantic search might miss. This is particularly valuable for technical content with domain-specific vocabulary.

5 - Generation

The retrieved chunks are assembled into a prompt alongside the original question and any system instructions. The language model generates a response grounded in that context. The quality of the generated answer depends on the quality of the retrieved context: if the retrieval surfaces the right chunks, a smaller model can answer accurately. If retrieval surfaces noise or misses relevant content, a larger model cannot fully compensate.

This is why retrieval quality matters more than model scale for RAG systems, a principle that applies equally to agent memory retrieval.


Where RAG works well

Question answering over documents. A legal team querying a contract archive. A support agent answering questions from a product knowledge base. A researcher searching an internal document library. RAG is well-suited to any use case where the knowledge base is a corpus of documents and the primary task is answering questions about what those documents contain.

Knowledge bases that update frequently. Because knowledge lives in the external store rather than in model weights, it can be updated independently. New documents are chunked, embedded, and indexed. The model immediately has access to the new information without retraining.

Domain-specific applications. Medical, legal, financial, and technical applications where the relevant knowledge is specialised and not well-represented in general training data benefit significantly from RAG.

Reducing hallucination. Grounding model responses in retrieved context reduces hallucination on factual questions. The model is answering from retrieved evidence rather than from pattern-matching over training data. This is not a complete solution to hallucination but it is a meaningful improvement for fact-dependent use cases.


Where RAG falls short

Static corpus assumption. RAG assumes a corpus of documents that are indexed and retrieved. It has no native mechanism for tracking how facts change over time, detecting contradictions between documents, or determining which of two conflicting sources is more current. For knowledge bases where facts evolve, this is a significant limitation.

No user-specific state. RAG retrieves from a shared corpus. It has no concept of a specific user's history, preferences, or context. Two users asking the same question get the same retrieved context. Personalisation is not a RAG capability.

No cross-session synthesis. RAG answers questions about what is in the corpus. It cannot synthesise information across a user's past interactions, track how their situation has changed, or assemble an answer from fragments distributed across multiple sessions. This requires a different kind of infrastructure.

Chunking information loss. The chunking process discards document structure, cross-references, and context that spans chunk boundaries. Complex questions that require reasoning across multiple sections of a document, or across multiple documents, may not be answerable from the chunks that retrieval surfaces.

Retrieval ceiling. At large scale, vector similarity retrieval degrades. See semantic collapse for a detailed treatment of what happens to retrieval quality as corpora grow.


RAG vs persistent memory

RAG and persistent agent memory are frequently conflated because both involve retrieval. They solve different problems and the distinction matters for building agents that work well over time.

RAG answers: what is in this corpus that is relevant to this query? It retrieves documents. It operates over a shared, relatively static knowledge base. It has no concept of a specific user's history or evolving state.

Persistent memory answers: what do I know about this user, this context, or this situation, and how has it changed? It retrieves facts, preferences, decisions, and state. It operates over an evolving, user-specific knowledge base. It tracks how things change over time and resolves contradictions as they arise.

A complete agent architecture typically needs both. RAG handles the knowledge base: domain content, product documentation, internal data. Persistent memory handles the user layer: who this person is, what they have told the system, what has changed. The two retrieval problems are distinct and require different infrastructure to solve.

See RAG vs agent memory: what's the difference for a full treatment of the distinction, including where the two approaches overlap and how they work together in a complete agent architecture.

Exabase supports both. The Memory API handles persistent agent memory. The Resources API handles document storage and retrieval for RAG workloads. See the docs to get started.

Ship your first app in minutes.