Blog

What LongMemEval actually measures and where it breaks down

Jonathan Bree


LongMemEval is the most widely cited benchmark for agent memory retrieval. Here is what it tests, where it is genuinely rigorous, and where its limitations matter for interpreting results.


Benchmarks are only useful if you understand what they measure and what they do not. LongMemEval has become the de facto standard for evaluating long-term memory in conversational AI systems. Most memory platforms that publish benchmark results cite it. Most comparisons between systems reference it. Understanding what it actually tests, and where it stops being a reliable signal, is essential for making sense of those comparisons.


What LongMemEval is

LongMemEval was introduced by Wang et al. in 2024 as a benchmark for evaluating long-term memory in conversational AI systems. Its core design is a needle-in-a-haystack problem at scale: approximately 115,000 tokens of conversational history spread across multiple sessions, with relevant information scattered sparsely among a much larger volume of irrelevant dialogue.

The benchmark contains 500 questions. Each question requires the system to retrieve the right information from within that large, noisy corpus and answer correctly. Results are reported at multiple retrieval depths, top-10, top-20, top-50, and top-200, measuring how effectively a system surfaces relevant memories within a limited context window.

The design is deliberately challenging. Relevant evidence is not clustered. It is sparse, often expressed in passing, and surrounded by sessions of unrelated conversation. A system that performs well on LongMemEval has demonstrated genuine retrieval quality under realistic noise conditions.


What it actually tests

LongMemEval covers six distinct memory categories, each testing a different capability.

Single-session recall (user information) tests whether the system can retrieve facts a user explicitly stated about themselves in a past conversation. This is the most basic form of memory: a user mentioned their dietary restriction, their daughter's name, their job title. Can the system find it?

Single-session recall (assistant-provided) tests whether the system treats its own prior outputs as part of the memory corpus. Not just what the user said, but what the system said in response. This is a subtle and often overlooked capability.

Single-session recall (preferences) tests whether the system can retrieve user preferences, which are often stated indirectly or in passing rather than as explicit declarations. Harder to extract and index than direct factual statements.

Multi-session reasoning is where the benchmark gets genuinely hard. These questions cannot be answered from any single conversation. The answer is distributed across multiple sessions: fragments from different times that must be identified, retrieved, and assembled into a coherent response. This tests both retrieval coverage and the system's ability to surface complementary fragments rather than redundant ones.

Temporal reasoning tests whether the system can reason about when things happened, what changed over time, and what the most recent state of a fact is. This requires not just retrieving information but understanding its temporal position and resolving contradictions between older and newer statements. See what is temporal memory for a deeper treatment.

Knowledge update tests whether the system recognises when previously shared information has changed. Users change jobs, move cities, update preferences. The system must identify that conflicting statements exist and resolve them in favour of the most recent. This is memory drift prevention tested directly.


Where LongMemEval is genuinely strong

The benchmark's strength is in the combination of scale, noise, and category diversity.

Single-session recall on its own is not a hard problem. Any competent retrieval system will perform well on it. LongMemEval makes it harder by embedding the relevant information in 115,000 tokens of irrelevant context. The noise is the point. A system that scores well on single-session categories under these conditions has demonstrated that its retrieval can find a needle in a haystack, not just retrieve from a clean, small corpus.

Multi-session synthesis and temporal reasoning are where most systems fall apart, and where the benchmark is most valuable as a signal. These categories test capabilities that framework memory modules do not attempt and that vector search alone cannot address. A system that scores well on these categories has demonstrated genuine architectural capability, not just retrieval quality on easy problems.

The multi-retrieval depth reporting is also useful. A system that peaks at top-200 but performs poorly at top-10 is likely relying on context volume rather than retrieval precision. A system that performs consistently across depths has demonstrated that its ranking is effective: it surfaces relevant memories early rather than relying on a large context window to compensate.


Where it breaks down

At accuracy levels above roughly 96%, the benchmark's own limitations become the primary constraint on further improvement. This is not a retrieval failure. It is a property of the benchmark itself.

Ambiguous questions. Some questions admit multiple reasonable interpretations. The expected answer reflects only one. A system that retrieves correct and relevant information but interprets the question differently will be scored as incorrect, even if its answer is defensible.

Overly narrow expected answers. In several cases, an answering model that draws deeper inferences from retrieved context and produces a more complete or nuanced answer than expected will be scored as incorrect because the answer does not match the expected response closely enough. The judge penalises insight.

Dataset inconsistencies. The benchmark contains questions with errors: contradictions within the conversational history, questions whose expected answers do not follow from the provided context, and cases where no memory system could answer correctly regardless of retrieval quality. These represent a noise floor that absorbs a fixed percentage of any system's score.

In Exabase's evaluation of M-1, at least four questions with errors were identified. Two were reported upstream. Additional issues were identified but not yet formally reported. These benchmark errors are the ceiling above which no system can improve through better retrieval alone.


What it does not test

Understanding LongMemEval's limitations is as important as understanding its strengths.

It is a single benchmark. LongMemEval tests selective recall under noise across six categories. Other benchmarks test different things. BEAM tests memory at larger token volumes. LoCoMo tests across different task types and uses word-overlap scoring that makes cross-system comparison unreliable. No single benchmark covers the full problem space of production agent memory.

It uses controlled conditions. Production memory systems must handle adversarial inputs, contradictory information fed deliberately, privacy constraints, graceful degradation under unusual queries, and real users who communicate in ways that benchmarks do not anticipate. None of this is tested.

It does not test production performance. Latency, cost at scale, behaviour under concurrent load, and reliability over months of real-world use are not measured. A system can score well on LongMemEval and still be impractical in production for cost or latency reasons.


How to read benchmark results

The most important caveat when comparing LongMemEval results across systems is model choice. A larger answering model can compensate for weaker retrieval by extracting correct answers from noisy or partially relevant context. Results achieved with Gemini 3 Pro are, in part, a measure of the model's reasoning capacity. Results achieved with Gemini 3 Flash reflect more directly on the retrieval system itself.

Comparing a Flash result to a Pro result is not an apples-to-apples comparison. The model used should always be reported alongside the score. When it is not, that is worth noting.

Prompt methodology matters too. A benchmarking script that includes question-category-specific prompt templates and targeted heuristics for known question types is being tested on benchmark optimisation, not production retrieval quality. In Exabase's evaluation, M-1 was tested with a single uniform prompt across all 500 questions, the same configuration that would be used in production. The prompt is published and reproducible.

The right way to use LongMemEval results is as a directional signal, not an absolute ranking. A system that scores 96% with a smaller model and a uniform prompt is doing something meaningfully different from a system that scores 94% with a larger model and category-specific prompts. The numbers look similar. The implications are not.

For the full methodology, results by category, and a discussion of ceiling effects, see the Exabase M-1 research paper. For context on what these categories mean for production agent memory, see what is agent memory.