Use cases

Contract analysis and search

Extract text from hundreds of contracts, store them as searchable resources, and find specific clauses, dates, obligations, and parties across the whole corpus by meaning.

A contract corpus is full of answers nobody can get to quickly. Somewhere across hundreds of agreements is every non-compete clause, every auto-renewal date, every indemnity obligation, every counterparty, but finding them means opening contracts one by one, because the corpus isn't searchable in any way that understands what's in it. The question "show me every non-compete clause across all our vendor agreements" is simple to ask and, without the right infrastructure, a manual slog to answer.

This page is for developers building contract analysis and search: a contract review tool, a clause-search feature, a due-diligence assistant. Extract pulls clean text from the PDFs, Resources store them as a searchable corpus, and Deep Search finds clauses, dates, obligations, and parties by meaning rather than exact words, so that question returns actual results instead of a reading list.

A scope note: this page is about infrastructure for searching and analysing contracts, not legal advice or any judgement about a contract's effect or enforceability. Those are matters for qualified review.

The problem

The first obstacle is getting clean text out of the contracts. Legal documents arrive as PDFs and scans in inconsistent formats, often long, and before anything can be searched the text has to come out reliably with its structure intact. Across hundreds of agreements that's an extraction problem in its own right.

The second is that contract language defeats keyword search. The clause you want rarely contains the words you'd search for: a non-compete might never use the phrase "non-compete," an indemnity might be worded six different ways across six agreements, an obligation might be buried in defined terms. Keyword search over a contract corpus misses what's phrased differently from your query, which in legal drafting is most of it. You need search that matches on meaning, so a query for non-compete provisions surfaces them however they're worded.

The third is scale. A real corpus is large, and search quality has to hold across all of it, the whole point is to find every instance of something, not the handful that happen to match your phrasing. Naive vector search degrades as the corpus grows, the semantic collapse problem, which is exactly the wrong failure mode when completeness is what matters. Producing all of this, clean extraction, meaning-based search, and quality that holds at scale, is the build between a pile of contracts and a searchable corpus.

What Exabase unlocks

With clean extraction and search that works on meaning across the whole corpus, contract questions get answered by query rather than by reading.

Hundreds of contracts become clean, searchable text. Whatever format they arrive in, including scans, the text comes out with structure preserved and page numbers attached, so a result can point back to the exact page in the exact agreement. The corpus stops being a folder of PDFs and becomes something you can search.

Clauses, dates, obligations, and parties become findable by meaning. "Every non-compete clause across all vendor agreements" returns the actual clauses, including the ones that never use the word, because the search matches intent rather than phrasing. A query for a type of obligation finds each instance across the corpus, not just the documents that share your wording.

And the search is complete across the whole corpus. Because retrieval quality holds at scale, a query runs across the entire body of contracts and returns the matches wherever they are, which is what makes "show me every instance" a question you can actually trust the answer to.

How it works

Three primitives carry contract search, with Extract getting the text out and Deep Search finding what matters in it.

Extract

Extract turns the contracts into clean, structured text through one API. You submit PDFs or scans and get back text chunked with page numbers, so a found clause can be traced to its exact location in the agreement. It handles the long documents contracts often are, and processes a large corpus with retries and webhooks so a batch of hundreds can run without manual babysitting. The general case is document extraction at scale.

Deep Search

Deep Search is what finds clauses and terms across the corpus. It searches at the paragraph level by meaning, so a provision surfaces regardless of how it's worded, and it's hybrid, so a specific defined term, party name, or section reference still matches exactly when you need a precise hit. It holds retrieval quality as the corpus grows, avoiding the semantic collapse that would make a large-corpus search return fuzzy or incomplete results, which matters most when completeness is the goal.

Resources

Resources are the searchable contract corpus. Each extracted contract becomes a Resource, indexed for Deep Search, so the agreements live as one searchable body rather than loose files. Storing them as Resources is what connects extraction to search and lets a query run across the entire corpus at once.

Example architecture

The flow is extract, store, search.

Build the corpus. Run each contract through Extract to get clean text with page numbers, using webhooks to track completion across a large batch, and store the results as Resources, indexed for Deep Search.

Search across everything. An agent or user runs a Deep Search over the whole corpus for a clause type, obligation, date, or party, and gets back the matching passages with their location, found by meaning.

Scope by matter or client if needed. For a tool serving multiple clients or separating matters, put each corpus in its own Base for isolation.

Keep it current. As new contracts are added, extract and index them the same way, or use a Worker to process new agreements automatically.

Contracts flow in through extraction into a searchable corpus, and a query finds clauses and terms across all of them by meaning. The pile of PDFs becomes a corpus you can interrogate.

What compounds over time

A contract search corpus gets more valuable as it grows, where a pile of PDFs gets less navigable.

Every contract added enlarges the searchable corpus, and because each is extracted once, the cost of making it searchable is paid a single time while the value lasts. A growing body of agreements becomes a more complete answer to questions like "where do we have this obligation" or "which contracts renew automatically," precisely because more of the corpus is covered. And because Deep Search holds quality at scale, a corpus of thousands of contracts stays as searchable, and as complete in its results, as a small one.

Building this yourself means an extraction pipeline and a search index that both get heavier to maintain as the corpus grows, and the search quality, the thing that determines whether "show me every instance" is trustworthy, is exactly what tends to degrade at scale when retrieval isn't built for it. Infrastructure that holds search quality as the corpus grows is the right bet when the corpus only ever gets larger and completeness is the point.

Who's building this

Teams building contract review tools, clause libraries, due-diligence assistants, and contract-lifecycle features, anywhere a body of agreements needs to be searched for specific clauses, terms, obligations, or parties.

This is closely related to legal AI agents for case research, which covers the broader case-research workflow including matter memory across sessions; this page focuses specifically on searching a contract corpus. Document extraction at scale covers the ingestion side, and multi-tenant memory for SaaS the per-client isolation if you're building this as a product.

Get started

Start with the getting started guide, then about extraction and submitting jobs for building the corpus, and searching resources for the search. There's a free tier to build against. As with any legal use, judgements about a contract's effect are for qualified review.

FAQs

Can it handle scanned contracts, not just digital PDFs?

Yes. Extract reads scans as well as native PDFs and returns clean text with page numbers, so a found clause can be traced back to its exact page in the agreement.

How does it find a clause that doesn't use the obvious term?

Deep Search matches by meaning, so a non-compete or indemnity surfaces however it's worded. It's also hybrid, so a specific defined term, party name, or section reference still matches exactly when you need a precise hit.

Can I really search every contract at once?

Yes. Stored as Resources, the whole corpus is searched together, and because Deep Search holds quality at scale, a query like "every non-compete across all vendor agreements" returns matches from across the entire body rather than just the closest-phrased few.

Will results stay complete as the corpus grows into thousands of contracts?

That scale is where naive vector search tends to degrade through semantic collapse, returning fuzzier and less complete results. Deep Search is built to hold retrieval quality as the corpus grows, which matters most when completeness is the goal.

How do I keep contract corpora separate for different clients?

Put each corpus in its own Base for full isolation, following the multi-tenant SaaS pattern, if you're building contract search as a product for multiple clients.

How is this different from the legal AI agents use case?

Legal AI agents cover the broader case-research workflow, including remembering matter context across sessions with Memory. This page focuses specifically on extracting and searching a contract corpus for clauses and terms. If your need is primarily clause search across agreements, this is the focused fit; if it's ongoing case research, that page is.