Blog

Exabase M-1 achieves state of the art on LongMemEval

Jonathan Bree

M-1, our first-generation memory engine, just achieved the highest reported score on LongMemEval, the main public benchmark for conversational memory.

M-1 reached 96.4% accuracy at top-50 retrieval, outperforming every other reported system.

The more important result is that we achieved this with a smaller, cheaper model.

We used Gemini 3 Flash, while comparable systems on the leaderboard reported their best results with Gemini 3 Pro.

That distinction matters because larger models can often compensate for weaker retrieval. When the retrieved context is noisy or only partly relevant, a bigger model is more likely to extract the right answer anyway. That can make a system appear to have strong memory, even when expensive inference is doing much of the work.

M-1 doesn't need the bigger models. Our encoding and retrieval approach is able to top the leaderboard without them.

All this makes the result more relevant for real-world deployment – this isn't a cost-prohibitive research setup. It's a memory system built to run at real, production scale.

LongMemEval is a difficult benchmark: roughly 115,000 tokens of conversational history, with relevant information spread sparsely across sessions and buried in noise. It tests six different memory capabilities: recalling user facts, recalling preferences, recalling things the assistant said, synthesizing across sessions, temporal reasoning, and recognizing when information has changed.

For Exabase, this is an important milestone. M-1 shows that our approach to memory can deliver state-of-the-art results efficiently, in a system built for real-world use.

We’ll keep improving it and share results on additional benchmarks in the coming months.

Read our full research paper, results and methodology here.

Ship your first app in minutes.

Start for free

Read the docs