Blog

Semantic search vs keyword search for AI agents

Semantic search and keyword search fail in opposite ways. That is not a coincidence. It is an argument for combining them.

Jonathan Bree

Every retrieval system makes a choice about how to find relevant information. Most make the wrong one. Here is why neither semantic search nor keyword search is enough on its own, and what happens when you combine them.

When developers build retrieval into an AI agent, they usually reach for one of two approaches. Semantic search, powered by embeddings and vector similarity, finds things by meaning. Keyword search, powered by lexical matching, finds things by exact terms. Both are useful. Both fail in ways that matter for production agents. And the failure modes are almost perfectly complementary.

Where semantic search breaks down

Semantic search is good at finding things that mean similar things, even when the exact words differ. A user who described their role as "product lead" in January will have that memory surface when they ask about "product management responsibilities" in March. The meaning connects even though the vocabulary does not.

The failure mode is precision. Semantic search optimises for conceptual similarity, which means it trades exactness for coverage. When a query contains specific identifiers, proper nouns, version numbers, error codes, or exact phrases that matter, semantic search will find things that are conceptually adjacent rather than literally matching.

Consider a concrete example. A developer tells their agent they are running into a specific error: "TypeError: Cannot read properties of undefined (reading 'map')." Later they ask the agent to recall what errors they flagged last week. A semantic search will surface memories about errors and debugging because those are conceptually similar. It may or may not surface the exact error message, depending on how crowded the embedding space is around that concept. A keyword search would find it immediately.

The same problem appears with names, dates, product identifiers, API endpoint names, and any other information where exact string matching matters more than conceptual similarity. Semantic search was not designed for precision retrieval. At scale, in a knowledge base full of related memories, this becomes a meaningful source of retrieval failure.

Where keyword search breaks down

Keyword search is precise. It finds exactly what you ask for, provided you use the same words. That is also its failure mode.

Language is not consistent across time or across people. A user who described a preference as "I like concise answers" in one session may ask for "brief responses" in another. A keyword search finds nothing. The words do not match. The memory is there but the retrieval system cannot connect the query to it.

This problem compounds in agent memory specifically because the system is storing things users said, not things the system prompted them to say. Users are inconsistent, paraphrase themselves, use synonyms, describe the same concept in different ways across different conversations. A retrieval system that depends on exact term matching will miss a significant fraction of relevant memories simply because the vocabulary drifted.

Keyword search also struggles with intent. A user asking "what did we decide about the timeline" is not searching for the word "timeline." They are searching for a decision made in a particular context. Keyword search finds documents containing the word. It does not find the concept.

Why the failure modes are complementary

Semantic search fails on precision. Keyword search fails on vocabulary variation. These are almost perfectly opposite weaknesses, which means combining them covers the ground that either alone misses.

M-1's retrieval scoring combines semantic similarity (Sₛₑₘ) and lexical precision (Sₗₑₓ) as separate, weighted signals rather than relying on either alone. A memory can score highly on semantic similarity, highly on lexical precision, or both. The combination means that exact matches are not drowned out by conceptually similar but less relevant results, and that paraphrased or semantically related memories are not missed because the vocabulary does not align.

The practical effect is visible in LongMemEval's results. Questions that require retrieving specific facts, names, decisions, or identifiers reward lexical precision. Questions that require connecting information across sessions and vocabulary reward semantic similarity. A system that optimises for one at the expense of the other will underperform on whichever category it deprioritises. M-1's multi-signal approach performs strongly across both.

What this means in practice

For developers building agent memory, the choice between semantic and keyword search is a false one. The question is not which to use. It is how to weight and combine them for a given retrieval context.

A user identifier should weight lexical precision heavily. A preference expressed in natural language should weight semantic similarity heavily. A question about a past decision may need both: semantic similarity to find the right context, lexical precision to surface the exact phrasing of the decision.

Building this weighting and combination layer on top of a vector database or a keyword search index is non-trivial. It requires understanding the query type, adjusting retrieval strategy accordingly, and assembling results from multiple retrieval passes into a coherent context. It is also the kind of infrastructure that does not differentiate your product. It is table stakes for reliable agent memory.

Exabase's Memory API handles hybrid retrieval out of the box. Semantic and lexical signals are combined at the retrieval layer, weighted appropriately for the query context, without any configuration required. See the docs for more.