Exabase M-1: Beyond Model Scale for Long-Term Memory
Achieving state-of-the-art on LongMemEval using a smaller, cheaper model
Exabase Research May 2026
Abstract
Long-term memory remains one of the least solved and least rigorously evaluated capabilities in production AI systems.
We present Mneme-1 (M-1), Exabase's first-generation long-term memory engine, and evaluate it against LongMemEval, the most comprehensive benchmark for conversational memory retrieval. M-1 achieves state-of-the-art results across all recall depths, reaching 96.4% at top-50 using Gemini 3 Flash as the answering model.
Notably, these results were obtained without any question-specific prompt engineering, and without large frontier models, and absorbing errors with the benchmark itself, which represent a ceiling on possible score improvement. We report our methodology, results, and analysis, including a discussion of evaluation ceiling effects encountered at high accuracy levels.
1. Introduction
The promise of persistent memory in AI systems is straightforward: a system that remembers what you've told it, across conversations or interactions, and over time, should be fundamentally more useful than one that doesn't. In practice, this is a hard problem. Human memory is not a database lookup. It is reconstructive, associative, and temporally sensitive. A person asked "what restaurant did I mention last month?" does not perform a keyword search. They reconstruct context from fragments: the conversation that it came up in, who they were with at the time, what else was happening.
Building systems that approximate this kind of recall, selectively retrieving and assembling relevant information from a large noisy body of conversational history, is the core challenge of long-term memory for AI.
Despite growing interest in this field, rigorous evaluation tools have lagged behind. Many systems are demonstrated anecdotally or tested on internal benchmarks that are not publicly reproducible. Where publicly-shared benchmarks exist, the reported results often reflect cherry-picked conditions that diverge significantly from practical production use: expensive, oversized models, question-specific prompt tuning, or evaluation setups that would be impractical at scale.
We set out to evaluate M-1 under production-realistic conditions, using a public benchmark, a practical model, and a methodology we could be fully transparent about.
2. About LongMemEval
LongMemEval (Wang et al., 2024) is a benchmark designed to evaluate long-term memory in conversational AI systems. It presents a challenging retrieval problem: approximately 115,000 tokens of conversational history spread across multiple sessions, with relevant information scattered sparsely among a much larger volume of irrelevant dialogue. The benchmark contains 500 questions spanning six categories, each testing a distinct facet of memory capability:
Single-session (user information). Can the system recall facts the user explicitly stated about themselves? This is the most basic form of memory: "I told you I'm allergic to shellfish" or "My daughter's name is Elara." Simple in isolation, but the information must be retrieved from within a much larger conversational context.
Single-session (assistant-provided). Can the system recall information that the assistant itself provided in a prior conversation? This tests whether the system treats its own outputs as part of the memory corpus, not just user utterances.
Single-session (preferences). Can the system recall user preferences expressed during conversation? Preferences are often stated indirectly or in passing, making them harder to extract and index than explicit facts.
Multi-session reasoning. Can the system synthesise information across multiple separate conversations to answer a question? This is substantially harder than single-session recall. The answer doesn't live in any one place; it must be assembled from fragments scattered across different sessions.
Temporal reasoning. Can the system reason about when things happened, what changed over time, or what the most recent state of a fact is? This requires not just retrieving information but understanding its temporal position and resolving contradictions between older and newer statements.
Knowledge update. Can the system recognise when previously shared information has changed? Users update facts about themselves across conversations: changing jobs, moving cities, switching tools. The system must identify that conflicting statements exist and resolve them in favour of the most recent.
The benchmark is scored by presenting retrieved memory context to a language model alongside each question, then using an LLM judge to evaluate whether the answer is correct. Results are reported at multiple retrieval depths (top-k), measuring how effectively the system can surface relevant memories within a limited context window. Consistent performance at smaller cutoffs validates the retrieval architecture itself, rather than relying on a large context window to compensate for imprecise recall.
LongMemEval is not without limitations, which we discuss in Section 5. Its strength is in its needle-in-a-haystack design: relevant evidence is sparse and scattered across multiple sessions within a large body of irrelevant conversation, and the benchmark tests multiple distinct memory capabilities. The harder categories, particularly multi-session synthesis and temporal reasoning, remain genuinely difficult.
3. Methodology
3.1 Evaluation Setup
We built our evaluation harness by forking the open-source benchmarking script released by Mem0 (Mem0, 2025). We made the following modifications:
Storage and retrieval were replaced entirely with M-1's retrieval engine.
Prompt simplification. The source benchmark script (from Mem0) contained question-category-specific prompt templates and targeted heuristics designed to improve performance on particular question types. We removed all such modifications and used a single, uniform prompt across all 500 questions. Our interest was in measuring M-1's retrieval quality, not in optimising the answering prompt per question category. Our single, general prompt approach is a realistic representation of how the memory capability would be used in production, and is robust against all potential inputs, rather than pre-optimised for a known set. You can view the prompt here.
Answering and judging model. We used a smaller model: Gemini 3 Flash, for both answering questions and judging correctness. This is discussed further below.
3.2 Model Choice
Several other systems reporting LongMemEval results have used Gemini 3 Pro or comparably large frontier models as their answering model. Larger models can compensate for weaker retrieval: given a noisy or partially relevant context window, a more powerful model is better at extracting the right answer despite imperfect inputs.
We deliberately chose Gemini 3 Flash, a substantially smaller and cheaper model. Our reasoning was straightforward: M-1 is designed for production use, and in production, cost and latency matter. A memory system that requires a frontier-scale model to deliver good results is not a production memory system; it is a research demo. All results reported in this paper reflect the system as it would actually be deployed: the same model, the same retrieval pipeline, the same configuration.
3.3 System Overview
M-1's architecture is informed by principles from cognitive science and memory research. Rather than treating memory as a log to be searched, M-1 treats it as a reconstructive, multi-stage process, drawing on ideas from episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002).
The retrieval pipeline operates in three phases:
First, candidate memories are scored against the query using a combination of semantic similarity, lexical precision, and temporal salience. This formulation reflects the interaction between associative and temporal factors observed in human recall (Ebbinghaus, 1885; Polyn et al., 2009). More recent memories are more accessible, but older memories of high relevance persist.
Second, queries are rewritten and decomposed into multiple parallel retrieval passes, each targeting a distinct information need. This is critical for multi-session questions, where no single query would surface all relevant fragments. The results are assembled into a unified context, mirroring the reconstructive nature of episodic recall, where complex memories are pieced together from multiple traces rather than retrieved whole (Bartlett, 1932; Schacter et al., 2011).
Third, retrieved candidates are re-ranked using importance scoring, temporal chain resolution, and cross-memory coherence, ensuring that the final context presented to the answering model is not just relevant but internally consistent and correctly ordered.
At a high level, the final retrieval score for a memory mᵢ given query q can be characterised as:
Where qⱼ are decomposed sub-queries weighted by wⱼ; Sₛₑₘ and Sₗₑₓ represent semantic and lexical similarity; T is a temporal salience function over timestamps and anchor events A; I is an importance score; C measures coherence of mᵢ against the broader retrieved set M; and Φ is the re-ranking function that combines these signals into a final score.
We intentionally limit the architectural detail disclosed here. The formulation above captures the general shape of M-1's retrieval scoring; the specific implementations, learned parameters, and additional mechanisms are proprietary.
4. Results
4.1 Overall Performance
M-1 achieves state-of-the-art results on LongMemEval across all evaluated retrieval depths.
Retrieval Depth | Overall Accuracy |
|---|---|
top-10 | 90.8% |
top-20 | 93.4% |
top-50 | 96.4% |
top-200 | 95.4% |
Performance peaks at top-50 (96.4%) and slightly declines at top-200. This is consistent with the expected behaviour of a well-tuned retrieval system: at greater depths, additional context introduces noise that can degrade answering accuracy, even as recall coverage increases.
The fact that performance peaks at a moderate depth rather than continuing to improve suggests that M-1's ranking is effective, surfacing the most relevant memories early rather than relying on brute-force context volume.
4.2 Performance by Category
Category | top-10 | top-20 | top-50 | top-200 |
|---|---|---|---|---|
Knowledge update | 96.2% | 97.4% | 97.4% | 97.4% |
Multi-session | 85.0% | 88.7% | 94.0% | 91.7% |
Single-session (assistant) | 100.0% | 100.0% | 100.0% | 100.0% |
Single-session (preference) | 93.3% | 93.3% | 96.7% | 96.7% |
Single-session (user) | 100.0% | 95.7% | 98.6% | 98.6% |
Temporal reasoning | 84.2% | 91.7% | 95.5% | 94.0% |
Single-session categories are near-ceiling across all depths, which is expected: the relevant information exists in a single location within the conversational history.
The above consistent performance score across all top-k cutoffs indicates robustness in Exabase's retrieval architecture – if the score was only optimal at top-200 for example, that would indicate that the model's context size is carrying the result, rather than the retrieval system. That is not the case here.
Multi-session reasoning is the most challenging category, peaking at 94.0% at top-50. These questions require assembling information from multiple conversations, which tests both retrieval coverage and the system's ability to surface complementary fragments rather than redundant ones.
Temporal reasoning shows strong improvement from top-10 (87.2%) to top-50 (96.2%), suggesting that temporal questions benefit from additional context that helps establish chronological ordering.
4.3 Comparative Results
The table below compares M-1's results against other systems that have reported LongMemEval scores. Where possible, we compare on the metric each system reported.
You can download the results JSON here.
System | Model | Score |
|---|---|---|
M-1 (Exabase) | Gemini 3 Flash | 96.4% |
Mem0 | Gemini 3 Pro | 94.8% |
HydraDB | Gemini 3 Pro | 90.79% |
Supermemory | Gemini 3 Pro | 85.2% |
Honcho | Gemini 3 Pro | 92.6% |
Note that all competing systems used Gemini 3 Pro, a significantly larger and more expensive model.
5. Analysis
5.1 Retrieval Architecture vs. Model Scale
The most notable aspect of these results is the model used to achieve them. M-1 reaches state-of-the-art accuracy using Gemini 3 Flash, while competing systems report results using Gemini 3 Pro, a model that is significantly more expensive and slower at inference.
Using a larger answering model can paper over weaknesses in retrieval. If the retrieved context is noisy or only partially relevant, a more powerful model is better equipped to extract the correct answer despite the noise (or take a best-guess). This means that results achieved with larger models are, in part, a measure of the model's reasoning capacity rather than the memory system's retrieval quality.
M-1's results with a smaller model suggest that the retrieval pipeline itself is doing more of the work: surfacing cleaner, more precisely targeted context that a smaller model can act on directly. This is the design philosophy we optimised for. In production, the cost and latency difference between Flash and Pro-tier models is substantial, often an order of magnitude, and compounds with every query.
5.2 Multi-Session Synthesis
Multi-session reasoning is the hardest category in LongMemEval, and it is where architectural differences between memory systems are most visible. Recalling a single fact from a single conversation is a relatively straightforward retrieval problem. Answering a question that requires synthesising information from multiple separate conversations is fundamentally harder: the system must identify that multiple fragments are relevant, retrieve all of them, and present them in a context that allows the answering model to connect them.
M-1's multi-query decomposition approach is designed specifically for this problem. By decomposing a complex query into multiple parallel retrieval passes, the system can target each information need independently, then assemble the results. This directly mirrors how the benchmark's multi-session questions are constructed: the answer depends on facts from different sessions, and no single retrieval query would surface all of them.
5.3 Evaluation Ceiling
At accuracy levels above ~96%, we began encountering a consistent ceiling effect driven not by retrieval failures but by artefacts in the benchmark itself. These fall into several categories:
Ambiguous questions. Some questions admit multiple reasonable interpretations, and the expected answer reflects only one. A system that retrieves correct and relevant information but interprets the question differently will be scored as incorrect.
Overly narrow expected answers. In several cases, the LLM answering model produced responses that were arguably more insightful or complete than the benchmark's expected answer, drawing deeper inferences from the retrieved context. These were scored as failures because they didn't match the expected response closely enough.
Inconsistencies in the dataset. We identified at least four questions in the benchmark with errors and reported two upstream (the other two issues were already reported by others). We identified additional questions with issues we have not yet formally reported. These include contradictions within the conversational history and questions whose expected answers don't follow from the provided context – something no memory system can solve for.
It is worth noting that we are not the first to encounter this ceiling. The Mem0 benchmarking script included question-specific prompt modifications and targeted heuristics, some of which appeared designed to work around known issues with particular questions. This approach can improve headline scores but conflates benchmark optimisation with genuine retrieval improvement. We chose instead to use a generic prompt and accept the score penalty, treating the benchmark's noise floor as a known limitation rather than something to be engineered around.
6. Limitations
Single benchmark. We evaluate on LongMemEval only. While it is the most evaluated-against and well-suited for testing selective recall under extreme noise, other benchmarks, particularly BEAM (Tavakoli et al., 2025), test memory at larger token volumes, and across different task types. We intend to test against this and other benchmarks in the future.
Benchmark noise. As discussed in Section 5.3, the benchmark itself contains errors and ambiguities that create a noise floor. Our score absorbs these errors in the spirit of equal benchmark comparison – including the cases we identified, but others likely remain.
Controlled conditions. Benchmark evaluation is inherently controlled. Production memory systems must handle adversarial inputs, contradictory information, privacy constraints, and graceful degradation, none of which are tested here.
Single model configuration. Our results reflect one model (Gemini 3 Flash) and our latest retrieval configuration. We have not yet reported results across multiple models or ablation studies isolating individual components of the retrieval pipeline.
7. Conclusion
M-1 achieves state-of-the-art results on LongMemEval while operating under constraints that reflect real-world production conditions: a smaller, faster, cheaper model; no question-specific prompt tuning; and limited token usage – representing a Pareto-frontier improvement over existing approaches. These results demonstrate that retrieval architecture, not model scale, is the primary determinant of memory system quality.
Long-term memory for AI systems remains an open and under-explored problem. We believe that rigorous, transparent evaluation is a prerequisite for meaningful progress, and we hope that by sharing our methodology and results in detail, we contribute to raising the standard for how memory systems are evaluated and compared.
M-1 is our first-generation memory engine. We are continuing to develop the system and expect to report further results in the coming months.
References
Bartlett, F.C. (1932). Remembering: A Study in Experimental and Social Psychology. Cambridge University Press.
Ebbinghaus, H. (1885). Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot.
Howard, M.W. & Kahana, M.J. (2002). A distributed representation of temporal context. Journal of Mathematical Psychology, 46(3), 269-299.
Mem0 (2025). LongMemEval benchmarking framework. GitHub repository. [URL]
Polyn, S.M., Norman, K.A., & Kahana, M.J. (2009). A context maintenance and retrieval model of organizational processes in free recall. Psychological Review, 116(1), 129-156.
Schacter, D.L., Addis, D.R., Hassabis, D., Martin, V.C., Spreng, R.N., & Szpunar, K.K. (2012). The future of memory: Remembering, imagining, and the brain. Neuron, 76(4), 677-694.
Tulving, E. (1972). Episodic and semantic memory. In E. Tulving & W. Donaldson (Eds.), Organization of Memory (pp. 381-403). Academic Press.
Wang, D. et al. (2024). LongMemEval: Benchmarking Long-Term Memory in AI Assistants. [arXiv / conference reference]