Document extraction at scale

One API for PDFs, images, audio, video, and web pages. Submit anything, get structured text chunks back with page numbers and timestamps.


Every application that does anything useful with documents starts with the same unglamorous problem: getting clean text out of them. PDFs, scanned images, audio, video, web pages, each format wants a different parser, each parser has its own failure modes, and keeping the lot working as formats drift is a maintenance job that never ends. It's the part of the build nobody wants to own, and it sits in front of everything else.

Exabase Extract is one API for all of it. Submit a PDF, an image, an audio file, a video, or a web page, and get back structured text chunks with page numbers and timestamps, ready to feed a RAG pipeline, build a search index, or store and search. This page is about using extraction as infrastructure instead of building and maintaining it yourself.


The problem

Extraction looks like a solved problem until you try to do it across real-world inputs at volume. A clean digital PDF is easy; a scanned one needs OCR. An image of a form needs layout understanding. Audio and video need transcription with timing. A web page needs the content pulled out of the markup. That's already several different systems, and each one is a dependency you now own, patch, and debug when an edge case breaks it in production.

Then there's scale. Processing one document is a function call; processing thousands is an infrastructure problem. You need to run jobs in parallel without falling over, handle the ones that fail without losing them, know when long-running jobs finish, and keep the throughput up when the queue is deep. Building that, the queueing, the retry logic, the failure handling, the notifications, is real backend work that has nothing to do with whatever your product actually does.

And the output has to be usable downstream. Raw extracted text isn't much good if it's an undifferentiated wall with no structure. To feed retrieval or search, you want it chunked sensibly, with page numbers so an answer can cite its source and timestamps so a moment in a recording can be located. Producing that consistently across every input format is yet more to build. None of this is your product; all of it is in the way of it.


What Exabase unlocks

With extraction as a managed API, the whole category of work disappears behind a single endpoint.

You submit anything and get back structured text. One interface covers PDFs, images, audio, video, and web pages, so you stop maintaining a parser per format and stop caring whether a given PDF happens to be scanned or digital. The output comes chunked with page numbers and timestamps, so it's ready to index, search, or hand to a model without further wrangling, and a downstream answer can point back to the exact page or moment it came from.

It runs at scale without you building the scaffolding. Submit thousands of documents and they process in parallel; jobs that fail are retried rather than silently dropped; webhooks tell you when work finishes so you're not polling. The queueing and reliability machinery that you'd otherwise build and operate is part of the service.

And the output drops straight into the rest of your system. Extracted content becomes Resources, indexed for Deep Search and ready to feed memory or retrieval, so extraction isn't a standalone step you bolt on but the front of a pipeline that's already connected.


How it works

Three primitives, with Extract doing the work, Resources holding the output, and Workers handling extraction that needs to run on a schedule.

Extract

Extract is the core: one API that turns any supported input into clean, structured text. You submit a job with a document, and retrieve the results as chunked text with page numbers and timestamps. It handles PDFs and images including scans, audio and video with transcription, and web pages, and it's built for the large files real workloads involve. Production concerns are part of it rather than your problem: jobs retry on failure, webhooks notify you on completion, and thumbnails are generated along the way. The chunked, source-referenced output is shaped for exactly what comes next, indexing, retrieval, or feeding a model.

Resources

Resources are where extracted content lives once it's processed. Storing output as Resources means it's immediately indexed for Deep Search, so extraction and search are one continuous flow rather than two systems you integrate. This is what makes extraction the front door to a RAG pipeline or a self-maintaining knowledge base rather than an isolated utility.

Workers

Workers handle extraction that needs to happen on a schedule or continuously rather than on a single submission. A Worker can process new documents as they arrive, re-extract sources that have been updated, and keep a corpus current without anyone triggering each run. For pipelines where documents keep coming, invoices arriving, sources changing, this is what turns extraction from a one-time batch into a standing process. The autonomous-maintenance angle is covered in depth in self-maintaining knowledge bases.


Example architecture

The flow is a front end you submit to and a back end that drops results into your pipeline.

Submit. Send documents to Extract, one at a time or in volume. For large batches, submit them in parallel and let the service handle the throughput.

Get notified. Rather than polling, register a webhook so you're told when each job completes. Failed jobs are retried, so you're not building that yourself.

Store and index. Write the results as Resources, where they're indexed for Deep Search and available to feed retrieval or memory.

Automate the recurring case. If documents arrive continuously or sources change, set up a Worker to process them on a schedule so the corpus stays current on its own.

Documents go in through one endpoint, process in parallel with retries and webhooks handled, and land as searchable Resources. Workers cover the recurring case. The extraction infrastructure you'd have built is the service.


What compounds over time

The value of treating extraction as infrastructure shows up most clearly as volume and variety grow.

Every document processed becomes part of a searchable corpus that stays useful indefinitely, so the cost of extracting a source is paid once and drawn on forever. As the range of formats you handle widens, that's range the service absorbs rather than parsers you add and maintain. With Workers keeping things current, the corpus grows and stays fresh at the same time, without a corresponding growth in operational work.

The do-it-yourself path runs the other way. A parser-per-format setup gets more brittle as formats multiply, the queueing and retry machinery needs more attention as volume climbs, and the whole thing demands ongoing engineering that scales with your throughput rather than staying flat. Extraction as a managed API means the hard operational parts, parallelism, retries, notifications, format coverage, stay the service's problem while your throughput grows.


Who's building this

Anyone whose product depends on getting text out of documents at volume: teams feeding RAG pipelines, building search over large corpora, processing contracts, invoices, or CVs, or handling audio and video libraries. Wherever extraction is the unglamorous step in front of the real work, this is the layer that removes it.

For the autonomous, keep-it-current side, self-maintaining knowledge bases is the close neighbour. For a concrete build, the simple document extraction tool example walks through the submit-and-retrieve loop end to end, and the document diff viewer example builds on extraction output.


Get started

Start with the getting started guide, then about extraction, submitting jobs, and retrieving results for the core flow, plus extraction webhooks for completion notifications. There's a free tier to build against.


FAQs

What formats can I submit?

PDFs and images including scans, audio and video with transcription, and web pages, all through one API. You don't maintain a separate parser per format or branch on whether a PDF is scanned or digital.


What does the output look like?

Structured text chunks with page numbers for documents and timestamps for audio and video, so a downstream answer can cite the exact page or moment a passage came from. The chunking is shaped for indexing, retrieval, and feeding a model.


How does it handle thousands of documents?

Submit them in parallel and the service handles the throughput. Failed jobs are retried rather than dropped, and webhooks notify you on completion, so you don't build queueing, retry logic, or polling yourself.


How do I know when a job finishes?

Register a webhook and you're notified on completion, including for long-running jobs like video transcription, so there's no need to poll for status.


Can extracted content go straight into search?

Yes. Store the output as Resources and it's indexed for Deep Search automatically, so extraction is the front of a connected pipeline rather than a separate step. This is the usual setup for a RAG pipeline.


What if documents keep arriving or sources change?

Use a Worker to process new or updated documents on a schedule, so the corpus stays current without manual runs. The self-maintaining knowledge bases use case covers this pattern.

Ship your first app in minutes.

Ship your first app in minutes.