Tools
PDF to JSON API
Submit any PDF. Get back structured metadata, text content, thumbnails, and document properties as JSON. Digital and scanned.
Try it below:
Turn PDFs into structured JSON
Upload a PDF or submit a PDF URL, get structured JSON back.
What is Exabase's PDF to JSON API?
Submit a PDF file or URL and get structured JSON back. The Exabase Extract API returns the document's metadata, text content, and a CDN-hosted thumbnail and rendered version, all in one response. It handles both digital and scanned PDFs.
What you get back
For a PDF, the response includes page count, author, title, creation date, MIME type, file size, a thumbnail of the first page, and a rendered version of the document. Text content is split into searchable chunks you can retrieve page by page. The API reference has the complete response schema.
The same POST /v2/extract endpoint works for images (with OCR), audio and video (with transcripts and timestamps), and web pages. You learn one API and use it for everything. The Extract docs cover each content type.
What you can build with it
The structured chunks this tool returns are the same format used across the Exabase platform. You can feed them into a RAG pipeline for retrieval-augmented generation, run contract analysis by searching chunks for clauses and dates, or process invoices by pulling out line items and totals. They also work well for compliance audit workflows where every document needs to be indexed.
Store the extracted content as Resources in a Base and it becomes searchable through Deep Search at the paragraph level. Extract key facts into Memory so your agent retains what it read between sessions. Workers can re-extract updated documents on a schedule to keep everything current.
Use the API
Submit jobs with the SDK or REST API. Processing is asynchronous. You submit the job and either poll for completion or set up a webhook to get notified automatically. For most PDFs, processing completes in a few seconds.
For working implementations, see Build a Simple Document Extraction Tool or Build a Document Diff Viewer. API keys are available on the free plan.
How do I use it?
Get your API key
Free, no credit card:
Sign up at exabase.io → copy your API key from the dashboard
Submit a PDF
Using the SDK (Node.js):
Get the result
Poll the job until state reaches completed:
API quick reference
Endpoint
Method
Description
/v2/extract
POST
Submit extraction job
/v2/extract
GET
List extraction jobs
/v2/extract/{jobId}
GET
Poll for result
/v2/extract-settings
PUT
Configure webhook
Auth: X-Api-Key header on all requests.
File retention: Stored files are retained for 1 day from job creation. Download or copy anything you need before they expire.
Why Exabase
Works with other Exabase features
Use cases
FAQs
What is Exabase?
Exabase is infrastructure for AI agents. It gives your agents memory, versioned file storage, AI deep search, and context automation through a set of APIs. Store what your agent learns, search inside any content type, and keep knowledge bases current automatically. Built for production use.
Who uses Exabase?
Developers and teams building AI agents, copilots, and RAG applications. If your agent needs to remember things between sessions, store and retrieve files, search across documents and media, or stay up to date without manual maintenance, Exabase handles that infrastructure so you can focus on your product.
How long does extraction take?
Most PDFs complete in a few seconds. Very large or complex documents may take longer. Processing is asynchronous, so your application isn't blocked while it runs.
Do I have to poll for results?
No. You can configure a webhook URL and Exabase will POST the full result to your server when processing completes. Delivery is retried with exponential backoff if your server doesn't respond with a 2xx.
Does it handle scanned PDFs?
Yes. OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the type and processes it automatically.
What happens if extraction fails?
The job state moves to failed. You can call the reprocess endpoint to retry without re-uploading the original file.
How long are files retained?
Stored files are retained for 1 day from job creation, then permanently deleted. Download or copy any files you need before they expire.
Can I submit a URL instead of uploading a file?
Yes. Pass a url field in the JSON request body pointing to a publicly accessible PDF. Exabase fetches and processes it.
Can I extract specific pages only?
You can fetch specific chunk ranges using the start and end parameters on the chunks endpoint. Each chunk includes a pageNumber field so you can filter by page on your end.
Does it work with password-protected PDFs?
No, password protection must be removed first.
What other file types does Extract support?
The same API handles images (with OCR), audio (with transcription), video (with transcription), and web pages (with metadata extraction). The response structure adapts to the content type.
What's the difference between this and the PDF thumbnail generator?
The thumbnail generator returns a preview image of a PDF page. Extract returns the actual content: metadata, text chunks, and document properties. One shows what the page looks like. The other tells you what it says. See PDF Thumbnail Generator.
Can I use the extracted content with other Exabase features?
Yes. Store extracted content as a Resource in Exabase and it becomes searchable through Deep Search. Workers can re-extract updated documents on a schedule. Extract is one entry point into the full platform.
Is there an SDK?
Yes. The @exabase/sdk package for Node.js/TypeScript handles file streaming, job polling, and chunk retrieval. Install with npm install @exabase/sdk.