RAG pipelines that actually remember

Add persistent memory on top of retrieval, so your pipeline remembers what worked, what the user asked before, and what context helped, instead of returning the same chunks every time.


Retrieval-augmented generation does one job well: pull relevant documents at query time and give the model something grounded to answer from. What it doesn't do is remember. Ask a RAG pipeline the same kind of question next week and it starts from scratch, retrieving as though it had never seen you, never learned which chunks actually helped, never noticed what you asked before. It retrieves; it doesn't accumulate.

For a lot of products that's a real ceiling. A pipeline that returns the same chunks for the same query forever isn't getting better, it's just consistent. This page is about adding persistent memory on top of retrieval, so the pipeline remembers the conversation, learns what context was useful, and improves with use, while keeping everything good RAG already gives you.


The problem

Standard RAG is stateless by design. Each query is independent: embed it, retrieve the nearest chunks, hand them to the model, forget the whole thing. That independence is fine for one-shot question answering and a genuine limitation everywhere else. The pipeline has no idea this user asked a related question yesterday, no record of which retrieved chunks led to a good answer and which were noise, no way to carry context from one turn to the next beyond stuffing history into the context window until it overflows.

There's a deeper version of the problem too. RAG and memory are different things that get confused because both involve "retrieving stuff." RAG answers questions about a corpus; memory maintains an evolving picture of a user or a situation. A pipeline that only does retrieval can't track that a user's circumstances changed, can't resolve that a newer fact supersedes an older one, can't tell that the answer it gave last week is now wrong. That's the difference between RAG and agent memory, and a pipeline that has retrieval but no memory is missing half of what most real applications need.

And the retrieval half has its own failure mode at scale. Naive vector search degrades as the corpus grows, returning fuzzier and fuzzier matches, the semantic collapse problem that makes a pipeline that worked in a demo disappointing in production. A vector database alone isn't a memory system, and it isn't always a great retrieval system either once the data gets big.


What Exabase unlocks

Put memory on top of retrieval and the pipeline stops being a lookup and starts being something that learns.

It remembers the conversation. A follow-up question is understood in the context of what came before, because the pipeline holds the thread rather than treating each query as the first. The user stops re-explaining what they meant, and the pipeline stops answering follow-ups as though they were openers.

It learns what context helps. Over use, the pipeline accumulates a picture of what the user cares about and what kind of context has been useful to them, so retrieval is shaped by more than the literal words of the current query. The same question from two different users, with two different histories, can pull the context each one actually needs.

It keeps its picture current. When something a user told the pipeline changes, memory resolves the contradiction rather than carrying the stale version alongside the new one, so the pipeline doesn't keep grounding answers in facts that have been superseded. And the retrieval underneath holds its quality as the corpus grows, so the pipeline that worked on a thousand documents still works on a million.


How it works

Four primitives combine into a RAG pipeline that remembers: Extract to ingest, Resources and Deep Search for retrieval, and Memory for the part standard RAG is missing.

Extract

Extract is the ingestion front door. It turns documents of every kind, PDFs, images, audio, video, web pages, into clean, structured text chunks through one API, with page numbers and timestamps so a retrieved chunk knows where it came from. Instead of maintaining a different parser for every source format, you submit anything and get back text ready to index. The general case is covered in document extraction at scale.

Resources and Deep Search

Resources store the extracted content, and Deep Search is the retrieval layer over it. Search happens at the paragraph level, semantically rather than by keyword, and it's hybrid by default, combining semantic matching with typo-tolerant keyword search so both conceptual queries and exact terms land. Crucially, it's built to hold retrieval quality as the corpus scales, which is exactly where do-it-yourself vector search tends to fall into semantic collapse. This is the retrieval half of RAG, done properly.

Memory

Exabase Memory is the half standard RAG leaves out. It holds the evolving picture of the user and the conversation: what they asked before, what they care about, what context proved useful, facts about their situation. You send interactions in and Exabase extracts and maintains what matters, resolving contradictions so the picture stays current. This is what turns a stateless retriever into a pipeline that improves with use, and understanding why it's a distinct capability from retrieval is worth the detour through what a memory layer actually is.


Example architecture

The pipeline has two flows, an ingestion path and a query path, and memory threads through the second.

Ingestion. Run source documents through Extract to get clean chunks, and store them as Resources so they're indexed for Deep Search. This is the corpus your pipeline retrieves from.

Query. On each query, do two things rather than one. Run a Deep Search over the corpus for relevant chunks, as any RAG pipeline would, and retrieve memory for what you know about this user and conversation. Both go into the prompt: the retrieved documents ground the answer, the memory personalises and contextualises it.

After the answer, send the interaction to Memory, so the pipeline accumulates context for next time, what was asked, what helped, what changed.

Scope per user or tenant with a Base if your pipeline serves many users who each need their own memory and corpus, following the multi-tenant SaaS pattern.

Documents flow in through extraction and become a searchable corpus; each query reads from both the corpus and the user's memory; each interaction writes back to memory. The retrieval stays constant in quality while the memory compounds.


What compounds over time

This is the heart of the pitch, because the entire complaint about standard RAG is that it doesn't improve, and memory is what fixes that.

A stateless RAG pipeline performs the same on its thousandth query as its first. A pipeline with memory performs better over time: it accumulates a picture of each user, learns what context serves them, and stops re-establishing things it already knows. Because the memory self-organises, that accumulation sharpens retrieval rather than cluttering it, and reliable memory has been shown to reduce hallucinations by around 28%, an effect that grows as the memory fills in. Precise retrieval paired with good memory is also why a smaller model on a well-built pipeline can outperform a larger model on a noisy one, the finding behind Exabase's state-of-the-art LongMemEval result.

The do-it-yourself comparison is stark here. Rolling your own means maintaining an extraction pipeline, tuning a vector index that degrades as it grows, and then building the entire memory layer, contradiction resolution and all, from scratch on top. Infrastructure where extraction, scalable retrieval, and self-organising memory come together as one system means the pipeline gets better with use while the work of running it stays flat.


Who's building this

Anyone building a RAG application that serves the same users repeatedly: knowledge assistants, research tools, support bots, internal search, anywhere a stateless retriever leaves value on the table because it never learns. If your pipeline answers one-off questions for anonymous users, plain retrieval may be enough; the moment there's a returning user or an evolving situation, memory is the missing piece.

For the retrieval-heavy end, document extraction at scale and self-maintaining knowledge bases are close neighbours, and long-term memory for any agent covers the memory layer on its own.


Get started

Start with the getting started guide, then searching resources for the retrieval side and creating memories for the memory side. If you're weighing this against assembling your own stack, the comparison of agent memory platforms is a useful read, and there's a free tier to build against.


FAQs

Isn't this just RAG with extra steps?

It's RAG with the missing half added. Standard RAG retrieves from a corpus and forgets everything else; this keeps that retrieval and adds persistent memory of the user and conversation on top. The retrieval works as before, but the pipeline now improves with use rather than starting cold every query. The RAG vs agent memory piece explains why the two are distinct.


Do I still get normal document retrieval?

Yes. Deep Search is full semantic-plus-keyword retrieval over your corpus, which is the retrieval half of RAG. Memory is added alongside it, not instead of it.


Will retrieval quality hold as my corpus grows?

Deep Search is built to hold quality at scale, which is where naive vector search tends to suffer semantic collapse and return progressively fuzzier matches. The corpus growing shouldn't degrade your results.


What kinds of documents can I ingest?

Extract handles PDFs, images, audio, video, and web pages through one API, returning chunks with page numbers and timestamps. You submit any supported source and get back text ready to index, without maintaining a parser per format.


How does memory keep from grounding answers in stale facts?

When a fact changes, memory resolves the contradiction through entity resolution, so the superseded version stops being treated as current. The pipeline grounds answers in the up-to-date picture rather than carrying old facts forward.


Can I run this per user so each gets their own memory and corpus?

Yes. Scope each user or tenant to their own Base for fully isolated memory and search, following the multi-tenant SaaS pattern. A shared corpus with per-user memory is also a common arrangement.


Ship your first app in minutes.

Ship your first app in minutes.