About extraction

The Extraction API lets you submit files or URLs for automated processing and receive structured metadata back – text content, thumbnails, document properties, web page previews, media attributes, and more. Each submission becomes an * extraction job* that progresses through a pipeline and exposes its results once complete.

How it works

Submit a file or a URL with POST /v2/extract.
Exabase processes the item asynchronously. The state field on the job reflects the current progress.
Once the job reaches completed, the extraction object is populated with all available data.
You can poll for completion or configure a webhook to be notified automatically.

Job states

State	Meaning
`pending`	Queued, not yet picked up for processing
`processing`	Actively being processed
`completed`	Processing finished, extraction data available
`failed`	Processing failed; the job can be reprocessed
`cleaned`	Job and its stored files have been deleted

Extraction data

The extraction object on a completed job is split into four sections. Not all sections are present for every kind of item.

Section	Available for	Key fields
`common`	All kinds	`mimeType`, `size`, `thumbnail`, `chunkCount`
`media`	Images, audio, video	`width`, `height`, `duration`, `ocr`, `transcript`
`document`	PDFs and documents	`pages`, `author`, `title`, `pdfRender`, `documentType`, `structured`
`web`	Bookmarks and web pages	`title`, `siteName`

Structured document data

When Exabase recognises the type of a document it sets extraction.document.documentType to one of invoice, resume, contract, or other. For recognised types the pipeline also populates extraction.document.structured with type-specific fields extracted from the document content.

`documentType`	Example `structured` fields
`invoice`	`vendor`, `invoiceNumber`, `issueDate`, `dueDate`, `lineItems`, `total`
`resume`	`name`, `email`, `phone`, `summary`, `experience`, `education`, `skills`
`contract`	`parties`, `effectiveDate`, `expirationDate`, `governingLaw`, `clauses`

The exact shape of structured depends on the document content — treat it as a free-form object and access only the keys you need.

Output formats

GET /v2/extract/{jobId} supports a format query parameter that controls the response representation:

`format`	`Content-Type`	Description
`json`	`application/json`	Default. Returns the full `ExtractJob` JSON object.
`markdown`	`text/markdown`	Returns a human-readable Markdown document built from the extraction data. Structured document data is rendered as a nested property list under a heading named after the document type.

File retention

Storage files attached to extraction jobs are retained for 1 day from job creation. After that the job is marked cleaned and its associated files are permanently deleted. You should download or copy any files you need to keep before the retention window expires.

Worker tools

Submitting extraction jobs