Blog

What is context window overflow and how to solve it

Semantic collapse is when your retrieval layer stops finding the right things and starts finding things that sound right. Here is why it happens and how to avoid it.

Jonathan Bree

Every agent developer hits the wall where context fills up and the agent starts losing information. The instinct is to get a bigger context window. That is the wrong solution to the right problem.

It starts subtly. Your agent gives an answer that contradicts something the user told it two sessions ago. It asks for information it already has. It loses track of a decision made earlier in a long conversation. Token limit errors start appearing in your logs. The agent that worked well in development is becoming unreliable in production.

This is context window overflow. And the way most developers try to fix it makes the underlying problem worse.

What is actually happening

Every language model has a context window: a limit on how many tokens it can process in a single inference call. Everything the model knows about the current situation, the system prompt, conversation history, retrieved documents, user instructions, must fit within that window. When it does not fit, something gets cut.

The context window is not a bug. It is a fundamental constraint of how transformer models work. Attention mechanisms scale quadratically with sequence length. Processing very long contexts is computationally expensive, slower, and at a certain point actively counterproductive.

This last point is less well understood than the token limit itself. Most developers treat the context window as a container: bigger is better, fill it as much as possible. The research suggests otherwise.

Why more context is not the answer

A well-documented finding in NLP research, sometimes called the "lost in the middle" problem, shows that language models systematically underperform on information placed in the middle of long contexts. Information at the beginning and end of a context is recalled more reliably than information buried in the middle. A longer context does not mean better recall of everything in it. It means more opportunities for relevant information to end up in the wrong place.

Needle-in-a-haystack evaluations, which test whether models can retrieve specific facts from within long contexts, consistently show accuracy degrading as context length increases. The model has access to the information. It still misses it more often when that information is surrounded by more noise.

Inference cost and latency compound the problem. Context length drives token usage, which drives cost and response time. An agent that tries to maintain a full conversation history in every prompt becomes progressively slower and more expensive to run as sessions accumulate, without becoming more accurate.

The finding from M-1's retrieval depth analysis on LongMemEval illustrates the same principle from the retrieval side: performance peaks at top-50 retrieved memories and slightly declines at top-200. Surfacing more context does not improve the answer when the additional context introduces noise alongside the relevant signal. This is not a specific weakness of M-1. It is a property of how language models process context generally. More is not better. Precise is better.

Why the naive solutions fail

When developers hit context limits, they reach for one of two fixes.

Truncation cuts the oldest or least recent content to make room. It is simple and deterministic. It is also wrong in a specific way: relevance and recency are not the same thing. A decision made six weeks ago may be more relevant to the current question than anything said in the last five minutes. Truncating by age discards information the model actually needs and keeps information it does not.

Summarisation condenses conversation history into a shorter representation. LangChain's ConversationSummaryMemory is the canonical example. Summarisation is better than truncation because it attempts to preserve meaning rather than just cutting. The problem is that summarisation is lossy and irreversible. The summariser decides what matters at the time of summarisation, before knowing what future questions will be asked. Detail that seemed incidental when summarised cannot be recovered when it turns out to be exactly what was needed. Summaries also flatten nuance: the specific wording of a preference, the caveats attached to a decision, the context that made a fact meaningful.

Both approaches treat the symptom. The agent has too much history to fit in the context window, so they reduce the history. Neither asks whether the history being included is the right history in the first place.

The correct framing

Context window overflow is not a size problem. It is a retrieval precision problem.

The question is not "how do we fit more into the context window." It is "how do we ensure the right information is in the context window." A small context containing precisely the relevant memories outperforms a large context containing everything, because the model can attend to what matters without noise degrading its accuracy.

This reframe has a practical consequence. The solution to context overflow is not a larger model, a longer context window, or a more aggressive summarisation strategy. It is a retrieval layer that selects what goes into the context with enough precision that you never need to fill the window to get a good answer.

What precise retrieval looks like

Precise retrieval means surfacing the memories that are relevant to the current query, and only those. Not the most recent. Not the most similar by cosine distance. The most relevant, accounting for semantic meaning, lexical precision, temporal position, importance, and coherence with other retrieved memories.

This is what Exabase's Memory API does. Rather than returning everything and letting the model sort it out, M-1 retrieves a compact, precisely targeted context that a smaller model can act on directly. This is why a system using Gemini 3 Flash with precise retrieval outperforms systems using Gemini 3 Pro with noisier context. The model is not doing less work. It is doing the right work, on the right information, without noise getting in the way.

Context window overflow is a symptom. Imprecise retrieval is the cause. See the docs to get started with retrieval that solves the right problem.