What is Extract?
Exabase Extract takes any file or URL and returns structured JSON. Metadata, text chunks, and thumbnails. One endpoint for PDFs, images, audio, video, and web pages.
Text is split into chunks with page numbers or timestamps. Processing is asynchronous. Poll for results or get notified via webhook.
Problem
Parsing is a full-time job.
Every source is different.
PDFs have tables that break across pages. Web pages hide content behind JavaScript. Audio needs transcription. Images need OCR.
You write a parser for each one, then have to maintain it forever.
Solution
One endpoint. Any source.
Send Exabase Extract a URL or a file.
It handles the parsing, the processing, and the edge cases.
You get structured JSON back with metadata, text chunks, and thumbnails. Ready for your pipeline or your agent.
Supported input types
All types go through the same POST /v2/extract endpoint. The response structure adapts to the content type.
PDFs and documents
Page count, author, title, creation date, rendered PDF, thumbnail, text chunks with page numbers.
Images
Width, height, OCR text, thumbnail.
Audio
Transcript with timestamps, duration.
Video
Transcript with timestamps, duration, dimensions.
Web pages and bookmarks
Title, site name, page content as text chunks.
What you get back
All types go through the same POST /v2/extract endpoint. The response structure adapts to the content type.
Non-exhaustive list below:
Extracted content
Field
Type
Description
text
string
The extracted text
sequence
number
Chunk order (1, 2, 3...)
pageNumber
number
Source page (documents)
timeStart
number
Start timestamp in seconds (audio/video)
timeEnd
number
End timestamp in seconds (audio/video)
Document properties
Field
Type
Description
pages
number
Page count
author
string
Document author
title
string
Document title
creationDate
string
Creation date (ISO 8601)
pdfRender.url
string
CDN URL for rendered version
Media properties
Field
Type
Description
width
number
Width in pixels
height
number
Height in pixels
duration
number
Duration in seconds (audio/video)
ocr
string
OCR text (images)
transcript
string
Transcript (audio/video)
Web properties
Field
Type
Description
title
string
Page title
siteName
string
Site name
Common metadata
Field
Type
Description
mimeType
string
Detected MIME type
size
number
File size in bytes
thumbnail
string
CDN-hosted thumbnail URL
chunkCount
number
Number of text chunks extracted
How do I use it?
Submit a file or URL
Using the SDK:
Get the result
Poll until state reaches completed, or use webhooks to skip polling entirely.
Read the text content
Extracted text is split into chunks. Fetch them in pages:
Or skip polling with webhooks
Set up a webhook URL once. Exabase sends a POST request with the full result whenever a job completes. Failed deliveries are retried with exponential backoff.
Use cases
Why Exabase
Part of a full platform
FAQs
What is Exabase?
Exabase is infrastructure for AI agents. It gives your agents memory, versioned file storage, AI deep search, and context automation through a set of APIs. Store what your agent learns, search inside any content type, and keep knowledge bases current automatically. Built for production use.
Who uses Exabase?
Developers and teams building AI agents, copilots, and RAG applications. If your agent needs to remember things between sessions, store and retrieve files, search across documents and media, or stay up to date without manual maintenance, Exabase handles that infrastructure so you can focus on your product.
How long does extraction take?
Most files complete in a few seconds. Very large or complex documents may take longer. Processing is asynchronous, so your application isn't blocked while it runs.
What file types are supported?
PDFs, Word documents, images (JPEG, PNG, WebP, GIF), audio files, video files, and web pages/URLs. The API auto-detects the content type and processes it accordingly.
Does it handle scanned PDFs?
Yes. OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the type and processes it automatically.
Do I have to poll for results?
No. You can configure a webhook URL and Exabase will POST the full result to your server when processing completes. Delivery is retried with exponential backoff if your server doesn't respond with a 2xx.
What happens if extraction fails?
The job state moves to failed. You can call the reprocess endpoint to retry without re-uploading the original file.
How long are files retained?
Stored files are retained for 1 day from job creation, then permanently deleted. Download or copy any files you need before they expire.
Can I submit a URL instead of uploading a file?
Yes. Pass a url field in the JSON request body pointing to a publicly accessible file or web page. Exabase fetches and processes it.
What do text chunks look like?
Each chunk contains the extracted text, a sequence number, and a location reference. For documents, that's the page number. For audio and video, that's the start and end timestamps in seconds.
Does it work with web pages that use JavaScript?
Yes. Web pages are rendered with JavaScript before extraction. Content that loads dynamically is captured.
Can I use the extracted content with other Exabase features?
Yes. Store extracted content as a Resource in Exabase and it becomes searchable through Deep Search. Workers can re-extract updated documents on a schedule. Extract is one entry point into the full platform.
Is there an SDK?
Yes. The @exabase/sdk package for Node.js/TypeScript handles file streaming, job polling, and chunk retrieval. Install with npm install @exabase/sdk. Or call the REST API directly from any language.
What's the difference between Extract and Deep Search?
Extract processes files and returns structured data. Deep Search searches across content you've already stored. They work together: extract a document, store it as a Resource, and it becomes searchable through Deep Search.
What's the difference between Extract and the link preview API?
The link preview API returns metadata only (title, description, image, URL). Extract returns the full page content as structured text chunks. Link preview is a lightweight GET request. Extract is async processing for deeper extraction.