Blog

How to cut your token costs by up to 90%

Three patterns to reduce your token costs

Jonathan Bree


Most teams building agents or AI hit their first shocking LLM cost spike and blame things like traffic. More users means more requests means more money, and that makes some sense. But looking at the actual invoices, the growth actually tracks with context rather than users: longer conversations, bigger knowledge bases, more history injected per call.


This guide will walk through 3 patterns that reduce token spend, using structured memory and precise retrieval. We use Exabase for the examples, but the patterns apply regardless of tooling.


Where your tokens actually go

A typical AI agent request looks something like this:




The system prompt and user message are mostly fixed. The two things that grow without bound are conversation history and retrieved documents, and those are also the two things most teams handle naively by stuffing everything in and hoping the model figures it out.

Every token you send that does not help the model answer this specific query is waste, and not just in cost terms. Models degrade with irrelevant context, so you are paying more to get worse answers.


Pattern 1: Replace conversation history with extracted memory

This is the largest single source of token waste in most agent architectures.

The common approach: append the full conversation history to every request. A 15-message thread might be 8,000 tokens, and by message 30 you are at 15,000+. If the user asks "what's my account status," you are sending all of that context so the model can find the one exchange where they mentioned their plan.

The structural fix: extract facts from conversations into memory, then retrieve only what is relevant.

// After a conversation, extract what matters
await exabase.memories.create({
  source: 'text',
  content: conversationTranscript, // 8,000 tokens of conversation
});
// Exabase extracts: "User is on Enterprise plan. Migrated from Free on June 3.
// Prefers email notifications. Primary use case is contract analysis."
// → ~50 tokens of structured facts
// After a conversation, extract what matters
await exabase.memories.create({
  source: 'text',
  content: conversationTranscript, // 8,000 tokens of conversation
});
// Exabase extracts: "User is on Enterprise plan. Migrated from Free on June 3.
// Prefers email notifications. Primary use case is contract analysis."
// → ~50 tokens of structured facts
// After a conversation, extract what matters
await exabase.memories.create({
  source: 'text',
  content: conversationTranscript, // 8,000 tokens of conversation
});
// Exabase extracts: "User is on Enterprise plan. Migrated from Free on June 3.
// Prefers email notifications. Primary use case is contract analysis."
// → ~50 tokens of structured facts

On the next request, instead of injecting the full history:

// Retrieve only what's relevant to this query
const memories = await exabase.memories.search({
  query: userMessage, // "what's my account status?"
  rerankThreshold: 0.5,
});

// Returns 2-3 relevant memories ≈ 100-200 tokens
// vs. 8,000+ tokens of raw conversation history
// Retrieve only what's relevant to this query
const memories = await exabase.memories.search({
  query: userMessage, // "what's my account status?"
  rerankThreshold: 0.5,
});

// Returns 2-3 relevant memories ≈ 100-200 tokens
// vs. 8,000+ tokens of raw conversation history
// Retrieve only what's relevant to this query
const memories = await exabase.memories.search({
  query: userMessage, // "what's my account status?"
  rerankThreshold: 0.5,
});

// Returns 2-3 relevant memories ≈ 100-200 tokens
// vs. 8,000+ tokens of raw conversation history

Why this works better than conversation chunking: some tools chunk your conversation history and retrieve relevant chunks, which helps, but a conversation chunk still contains filler like greetings, the user rephrasing their question, and the assistant's preamble. A memory is the distilled fact. "User is on Enterprise plan, migrated June 3" is 10 tokens, while the original exchange where they discussed upgrading might be 400.

There is also a correctness advantage. If a user was on the free plan in message 5 and upgraded in message 22, a chunk retriever might surface both, and the model then has to reconcile conflicting information. Exabase's memory engine resolves contradictions at write time, so the old fact is updated rather than duplicated. The model gets one clean answer instead of two conflicting chunks.


Rough token math:


Conversation stuffing

Memory retrieval

10-message conversation

~6,000 tokens injected

~200 tokens retrieved

30-message conversation

~18,000 tokens injected

~300 tokens retrieved

50-message conversation

~30,000+ tokens injected

~350 tokens retrieved


Memory retrieval stays nearly flat as conversations grow because you are retrieving relevant facts rather than proportional history, and the savings compound with every additional message.

The tradeoff to be honest about: extraction is not free. When Exabase processes a conversation into memories, that processing has a cost. But you pay it once per conversation and you save on every subsequent request that would have injected that history. If a user sends 20 messages in a session, you have paid the extraction cost once and avoided inflated context on 19 requests. It amortises fast.


Pattern 2: Replace document dumps with sub-document retrieval

The second major source of context bloat is RAG. The standard pattern is to chunk documents into roughly 500-token blocks, embed and store them, retrieve the top-K chunks by similarity, and inject all of them into the prompt.

The problem is that top-5 chunks is often 2,500 tokens, and maybe one of them actually contains the answer. The rest are there because vector similarity is not the same as relevance.

The fix: search at the sub-document level and retrieve specific passages rather than fixed-size chunks.

const results = await exabase.search.query({
  query: { text: "what were the Q3 revenue projections?" },
  precision: 0.7,  // higher = fewer, more relevant results
});

// Returns:
// {
//   score: 0.94,
//   chunks: [{
//     text: "Q3 revenue reached $4.2M, up 18% from Q2...",
//     pageNumber: 7,
//     score: 0.94
//   }],
//   name: "Q3 Financial Report.pdf"
// }
const results = await exabase.search.query({
  query: { text: "what were the Q3 revenue projections?" },
  precision: 0.7,  // higher = fewer, more relevant results
});

// Returns:
// {
//   score: 0.94,
//   chunks: [{
//     text: "Q3 revenue reached $4.2M, up 18% from Q2...",
//     pageNumber: 7,
//     score: 0.94
//   }],
//   name: "Q3 Financial Report.pdf"
// }
const results = await exabase.search.query({
  query: { text: "what were the Q3 revenue projections?" },
  precision: 0.7,  // higher = fewer, more relevant results
});

// Returns:
// {
//   score: 0.94,
//   chunks: [{
//     text: "Q3 revenue reached $4.2M, up 18% from Q2...",
//     pageNumber: 7,
//     score: 0.94
//   }],
//   name: "Q3 Financial Report.pdf"
// }

Two things matter here. The precision parameter lets you explicitly trade recall for relevance: set it high when the user asks a specific question and you will get fewer results that are more likely to be the right ones, or set it low for research-style queries where breadth matters. This is a direct lever on token spend per request.

Results also come back with page numbers and timestamps, so you can cite sources without injecting entire documents for the model to search through.


Rough token math:


Standard top-5 RAG

Precise sub-document retrieval

Specific question

~2,500 tokens (5 × 500)

~400-800 tokens (1-2 precise chunks)

Broad research query

~2,500 tokens

~1,500 tokens (3-4 relevant chunks)

The savings are smaller than Pattern 1 but they stack. The quality improvement is also worth noting: giving the model 800 tokens of exactly the right context instead of 2,500 tokens of roughly related context often means shorter and more accurate outputs, which saves on output tokens as well.


Pattern 3: Replace LLM calls with memory lookups

Some questions your agent handles do not need an LLM at all. What plan is this user on, have they contacted support before, what is their preferred language — these are lookups, not generation tasks. But if the only place that information lives is in past conversations, you either need an LLM to go find it or you stuff the history and hope the model picks it up.

With structured memory, these become API calls:

// Instead of an LLM call with 5,000 tokens of context:
const result = await exabase.memories.search({
  query: "user plan tier",
});
// → { content: "User is on Enterprise plan", score: 0.89 }
// 200ms, zero LLM tokens
// Instead of an LLM call with 5,000 tokens of context:
const result = await exabase.memories.search({
  query: "user plan tier",
});
// → { content: "User is on Enterprise plan", score: 0.89 }
// 200ms, zero LLM tokens
// Instead of an LLM call with 5,000 tokens of context:
const result = await exabase.memories.search({
  query: "user plan tier",
});
// → { content: "User is on Enterprise plan", score: 0.89 }
// 200ms, zero LLM tokens

This will not replace your core agent logic, but in a typical session there are several moments where the agent needs a fact it already knows, and each one you handle with a memory lookup instead of an LLM inference is an entire call you are not paying for.


Putting it together

Here is what a single request might look like before and after:

Before — naive context construction:




After — memory + precise retrieval:




That is roughly a 90% reduction in input tokens on a request that, to the user, works the same or better.

To put a number on it: at GPT-4o's input pricing ($2.50/M tokens), 10,000 requests per day at 15,100 tokens each costs approximately $1,133/month on input alone, while the same volume at 1,450 tokens each costs approximately $109/month.

These numbers are illustrative and your actual savings depend on conversation length, document corpus size, and query patterns. But the structural argument holds: extracted memory and precise retrieval scale flat, while conversation stuffing and naive RAG scale linearly with usage.


Build vs. buy

You can build this yourself. You would need a memory extraction pipeline with LLM calls to extract facts from conversations, deduplication logic, and contradiction handling. A storage layer with semantic search, an embedding pipeline, and reranking. Document extraction and chunking for PDFs, OCR, and audio transcription. Multi-tenant isolation via something like Bases if you have multiple users. And maintenance automation with Workers to keep memories current and prune stale data.

Teams that build this typically spend two to four months on infrastructure before shipping a feature, and then they maintain it.

Exabase gives you Memory, Deep Search, Extract, and per-tenant Bases through one SDK. The token savings come from the architecture: extracted facts instead of raw history, sub-document retrieval instead of chunk dumps. That architecture is what takes months to build well.


What this does not cover

This is about the context layer. There are other real ways to reduce LLM costs, including prompt caching (supported natively by Anthropic and Google), moving static instructions to system messages, using structured output schemas, and routing simple tasks to smaller models. Those are all worth doing, but they are developer practices rather than infrastructure problems. Use them alongside better context management, not instead of it.


Start here

If you want to test this against your own workload:

import { Exabase } from "@exabase/sdk";

const exabase = new Exabase({ apiKey: process.env.EXABASE_API_KEY });

// 1. Feed in a real conversation
await exabase.memories.create({
  source: 'text',
  content: yourConversationHistory,
});

// 2. On the next request, retrieve instead of stuffing
const memories = await exabase.memories.search({
  query: incomingUserMessage,
  rerankThreshold: 0.5,
});

// 3. Count the tokens — compare what you were injecting
//    vs. what you're injecting now
console.log(memories.hits.map(m => m.content).join('\n'));
import { Exabase } from "@exabase/sdk";

const exabase = new Exabase({ apiKey: process.env.EXABASE_API_KEY });

// 1. Feed in a real conversation
await exabase.memories.create({
  source: 'text',
  content: yourConversationHistory,
});

// 2. On the next request, retrieve instead of stuffing
const memories = await exabase.memories.search({
  query: incomingUserMessage,
  rerankThreshold: 0.5,
});

// 3. Count the tokens — compare what you were injecting
//    vs. what you're injecting now
console.log(memories.hits.map(m => m.content).join('\n'));
import { Exabase } from "@exabase/sdk";

const exabase = new Exabase({ apiKey: process.env.EXABASE_API_KEY });

// 1. Feed in a real conversation
await exabase.memories.create({
  source: 'text',
  content: yourConversationHistory,
});

// 2. On the next request, retrieve instead of stuffing
const memories = await exabase.memories.search({
  query: incomingUserMessage,
  rerankThreshold: 0.5,
});

// 3. Count the tokens — compare what you were injecting
//    vs. what you're injecting now
console.log(memories.hits.map(m => m.content).join('\n'));


The clearest way to validate this is your own token logs before and after. That is also where you will find the patterns specific to your product: which queries benefit most, where memory retrieval is sufficient, and where you still need full context.


Exabase docs · Memory API · Deep Search