Blog
Short-term vs long-term memory in AI agents

Jonathan Bree

Most agent frameworks handle short-term memory well. Long-term memory is a different infrastructure problem entirely. Here is why.
When developers talk about agent memory, they are often talking about two different things without realising it. Short-term memory keeps a conversation coherent. Long-term memory makes an agent useful across time. These are not the same problem, and the infrastructure that solves one does not address the other.
What short-term memory does
Short-term memory in an AI agent is conversation state within a session. It is the mechanism that lets an agent remember what was said three messages ago, maintain context across a multi-turn exchange, and avoid asking the user to repeat themselves within a single conversation.
Most agent frameworks handle this adequately. LangChain's conversation buffer, LlamaIndex's chat history, CrewAI's session memory: these tools store recent exchanges and inject them into the prompt. They work. For a single conversation with a single user in a single session, they do the job.
The ceiling is the session boundary. When the conversation ends, short-term memory ends with it. The next session starts from zero. The agent that helped a user work through a complex problem yesterday has no recollection of it today.
For simple use cases this is acceptable. For agents that are supposed to know their users, track ongoing work, or reason about how things have changed over time, it is a fundamental limitation.
Why long-term memory is a harder problem
Long-term memory requires the system to do something qualitatively different from buffering recent conversation history. It must store information persistently across sessions, retrieve the right fragments from a potentially large and noisy corpus, synthesise information that is distributed across multiple past conversations, and track how facts and preferences have changed over time.
Each of these is a non-trivial engineering challenge. Together they constitute a different class of problem from context window management.
The retrieval challenge alone is significant. In a short-term memory system, relevant information is always recent and always nearby. The signal-to-noise ratio is high because the corpus is small. In a long-term memory system, relevant information may be sparse, old, expressed in different language from the current query, and distributed across dozens of past sessions. Finding it requires more than buffering. It requires a retrieval architecture designed for noise, distance, and temporal variation.
Multi-session synthesis is harder still. A user makes a decision in February. They reference it obliquely in April. The memory system must identify that the April question is connected to the February decision, retrieve both, and assemble them into a coherent context. No single query will surface both fragments. The system must decompose the question, run parallel retrieval passes, and piece together an answer from parts that were never stored together.
This is not a problem of scale. It is a problem of architecture. A system that was designed for single-session coherence cannot be tuned into a multi-session synthesis engine. The assumptions are different at a foundational level.
What the benchmark data shows
LongMemEval is structured around six memory categories that test increasingly difficult recall tasks. Single-session categories, recalling facts, preferences, and information from within a single conversation, represent the easier end of the problem. Multi-session reasoning and temporal reasoning represent the hard end.
The pattern across all systems that have published LongMemEval results is consistent: single-session categories are near-ceiling. Recalling a fact stated explicitly within a conversation is a largely solved problem for any system with competent retrieval. The difficulty begins when the relevant information is distributed across multiple sessions, or when it has changed over time and the system must determine which version is current.
This is not a failure of any particular system. It reflects the genuine difficulty of the underlying problem. Piecing together sparse relevant information from a large noisy corpus of conversational history, across sessions that may span weeks or months, is fundamentally harder than recalling something said ten minutes ago in the same conversation.
M-1 achieves 96.4% overall on LongMemEval, with strong performance across all six categories including multi-session reasoning at 94% and temporal reasoning at 95.5% at top-50. These results reflect an architecture designed specifically for the long-term memory problem: query decomposition for multi-session questions, temporal salience so recency interacts correctly with relevance, and contradiction resolution so evolved facts supersede earlier ones rather than coexisting with them.
Why Exabase was built for long-term memory first
Exabase was not built speculatively. It was built because Fabric, a knowledge and workspace product with real users storing and retrieving information across months and years, needed a memory layer that could handle exactly this problem at production scale.
The requirements that came out of that experience shaped M-1's architecture from the ground up: cross-session persistence, temporal tracking, contradiction resolution, multi-session synthesis. These were not features added later. They were the original problem.
That origin matters because long-term memory systems reveal their weaknesses slowly. The gaps that do not show up in a demo or a short evaluation appear after weeks of use, when the corpus is large, the sessions are numerous, and the questions start requiring synthesis across time. Building for that from the start produces a different system than retrofitting long-term capabilities onto a short-term memory architecture.
The Memory API is the result. Persistent, cross-session, temporally aware memory for production AI agents. See the docs to get started.







