Blog

Exabase M-1 achieves state of the art on LongMemEval

Jonathan Bree

M-1, our first-generation memory engine, just achieved the highest reported score on LongMemEval, the main public benchmark for conversational memory.

M-1 reached 96.4% accuracy at top-50 retrieval, outperforming every other reported system.

The more important result is that we achieved this with a smaller, cheaper model.

We used Gemini 3 Flash, while comparable systems on the leaderboard reported their best results with Gemini 3 Pro.

That distinction matters because larger models can often compensate for weaker retrieval. When the retrieved context is noisy or only partly relevant, a bigger model is more likely to extract the right answer anyway. That can make a system appear to have strong memory, even when expensive inference is doing much of the work.

M-1 doesn't need the bigger models. Our encoding and retrieval approach is able to top the leaderboard without them.

All this makes the result more relevant for real-world deployment – this isn't a cost-prohibitive research setup. It's a memory system built to run at real, production scale.

LongMemEval is a difficult benchmark: roughly 115,000 tokens of conversational history, with relevant information spread sparsely across sessions and buried in noise. It tests six different memory capabilities: recalling user facts, recalling preferences, recalling things the assistant said, synthesizing across sessions, temporal reasoning, and recognizing when information has changed.

For Exabase, this is an important milestone. M-1 shows that our approach to memory can deliver state-of-the-art results efficiently, in a system built for real-world use.

We’ll keep improving it and share results on additional benchmarks in the coming months.

Read our full research paper, results and methodology here.