Blog

What is agent memory?

LLMs are stateless by design. Agent memory is the infrastructure layer that changes this. Here is what it is, how it works, and why architecture matters more than model scale.

Jonathan Bree

LLMs are stateless by design. Every conversation starts from zero. Agent memory is the infrastructure layer that changes this – and building it well turns out to be one of the harder problems in production AI.

An AI agent without memory is a very capable stranger. It can reason, write, analyse, and respond with sophistication. It has no idea who you are, what you told it last week, or what changed since your last conversation. Every session is a first meeting.

Agent memory is the infrastructure that gives AI agents persistent, evolving knowledge about the people and contexts they interact with. It is not a feature. It is an architectural layer that determines whether an agent is genuinely useful over time or merely impressive in a single session.

Why LLMs are stateless by design

Large language models process a context window: a fixed-size block of tokens containing everything relevant to the current inference call. The model attends to what is in that window and generates a response. When the call ends, nothing persists. The next call starts with whatever is placed in the new context window, which by default is nothing.

This is not an oversight. Statelessness makes models simpler to serve, easier to scale, and cheaper to run. For many use cases it is perfectly adequate. A model that answers a one-off question does not need to remember anything.

The problem arises when agents are expected to operate over time: across sessions, across users, across evolving facts and preferences. Statelessness, which is invisible in a demo, becomes a fundamental limitation in production. The agent that helped a user yesterday has no recollection of it today. The preference stated three weeks ago is gone. The decision made last month does not exist.

Agent memory solves this by maintaining state externally, outside the model, and injecting the right information into the context window at query time.

What agent memory is not

Three things are commonly confused with agent memory. The distinctions matter.

Context window management is not memory. Extending the context window, summarising conversation history, or truncating old messages to make room for new ones are all ways of managing what fits in a single inference call. None of them persist information across sessions. When the session ends, the information is gone. This is short-term memory, not agent memory.

RAG is not agent memory. Retrieval-augmented generation retrieves documents from a static corpus at query time to augment a prompt. It answers the question: what is in this knowledge base that is relevant to this query? Agent memory answers a different question: what do I know about this user, this context, or this situation, and how has it changed? The two are complementary but distinct. See RAG vs agent memory for a full treatment.

A vector database is not a memory system. Pinecone, Weaviate, Qdrant, and pgvector are retrieval infrastructure. They store embeddings and retrieve by similarity. A production memory system requires query decomposition, temporal salience, contradiction resolution, entity resolution, importance scoring, and cross-memory coherence on top of vector search. A vector database gives you one signal. Memory requires several. See why a vector database is not a memory system.

The cognitive science foundation

The most useful framework for understanding agent memory comes not from computer science but from cognitive psychology. In 1972, Endel Tulving proposed a distinction between two memory systems that has shaped memory research ever since.

Episodic memory is memory for events: what happened, when, and in what context. It is autobiographical and temporally situated. Remembering a specific conversation is episodic memory.

Semantic memory is memory for facts: what is true, independent of when or how it was learned. Knowing that a user prefers Python, or that a project deadline is end of Q3, is semantic memory.

Most AI memory systems handle one of these and ignore the other. Systems that store raw conversation logs have episodic memory but struggle to retrieve structured facts efficiently. Systems that extract and store facts have semantic memory but lose the context that gives facts meaning. A production memory system needs both. See episodic vs semantic memory for AI agents.

Equally important is Bartlett's concept of reconstructive memory, developed in 1932. Memory is not retrieval of a stored record. It is assembly: piecing together fragments, inference, and context into a coherent whole. This is why a memory system that simply returns the most similar stored items is not sufficient. The assembly of those items into coherent, consistent context is itself a non-trivial problem, and one that architecture must solve rather than delegating to the model.

Howard and Kahana's temporal context model, cited in M-1's architecture, adds another layer: memory is not just associative but temporally sensitive. Recent memories are more accessible. Older memories of high relevance persist. The interaction between recency and relevance must be modelled explicitly, not assumed away.

Types of agent memory

Short-term memory is conversation state within a session. It keeps a multi-turn exchange coherent. Most agent frameworks provide this out of the box. It does not persist between sessions. See short-term vs long-term memory in AI agents.

Long-term memory persists across sessions, across time, and across a potentially large and noisy corpus of past interactions. It is the hard problem. Retrieving relevant information from months of conversation history, synthesising across multiple sessions, tracking how facts have changed: these require dedicated infrastructure, not a context window management utility. See when to outgrow your framework's built-in memory.

Episodic memory stores events and interactions in their temporal context. It is the raw material from which answers are reconstructed.

Semantic memory stores distilled facts, preferences, and relationships, independent of the specific interactions in which they were learned.

Working memory is the active context assembled for a specific query: the result of retrieval, not the store itself. It is what gets injected into the prompt. Its quality depends entirely on the quality of the retrieval system that assembled it.

The failure modes of naive approaches

Understanding what agent memory is requires understanding where naive implementations break down.

Semantic collapse occurs when a vector store grows large enough that embedding similarity becomes an unreliable guide to relevance. Retrieval continues to return results with high confidence scores. The answers are quietly wrong.

Memory drift occurs when stored facts become stale as the world changes. A user changes jobs, updates a preference, or corrects something they said earlier. The old version persists in the store and continues to surface alongside the new one, with no signal about which is current.

Entity resolution failure occurs when the same entity is referenced differently across conversations and the memory system treats each reference as distinct. "John," "my manager," and "John Smith" accumulate as separate, unconnected fragments rather than a coherent picture of one person.

Context window overflow occurs when the instinct to solve memory problems by stuffing more into the context window backfires. Research consistently shows model performance degrading with context length. More context is not better context. Precise retrieval of the right memories outperforms cramming everything in.

Why architecture matters more than model scale

The instinct when memory systems underperform is to reach for a larger model. A more powerful model can extract correct answers from noisier context. This is true and also the wrong solution.

Retrieval quality determines what reaches the model. A large model working from imprecise, noisy, or stale context will produce imprecise answers. A smaller model working from precisely retrieved, temporally accurate, internally consistent context will outperform it. The benchmark results make this concrete.

On LongMemEval, the most rigorous public benchmark for conversational memory retrieval, M-1 scores 96.4% using Gemini 3 Flash. Competing systems using Gemini 3 Pro, a significantly larger and more expensive model, score between 85% and 94.8%. The gap is not explained by model capability. It is explained by retrieval architecture.

M-1 combines semantic similarity and lexical precision as separate signals, applies temporal salience so recency interacts correctly with relevance, resolves contradictions before context reaches the model, builds entity relationships at storage time, and applies cross-memory coherence in the re-ranking phase. The result is a retrieval system that surfaces what is accurate rather than what is merely similar. See the full research paper for methodology and results.

Exabase and agent memory

Exabase was built for long-term agent memory first, because Fabric, a knowledge and workspace product with real users storing and retrieving information across months and years, needed it. The requirements that came from that experience shaped M-1's architecture from the ground up.

The Memory API handles the full memory pipeline: chunking, enrichment, entity resolution, relationship tracking, contradiction resolution, temporal weighting, and hybrid retrieval. You add a memory, you search memories, you get back context that is accurate and ready for your prompt. The infrastructure complexity is managed. You focus on building.

Agent memory is not a feature. It is the difference between an AI agent that is impressive once and one that is genuinely useful over time. See the docs to get started.