Tools

PDF to JSON API

Submit any PDF. Get back structured metadata, text content, thumbnails, and document properties as JSON. Digital and scanned.

Try it below:

FILEDrop a file or click to upload

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Turn PDFs into structured JSON

Upload a PDF or submit a PDF URL, get structured JSON back.

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "kind": "document",
  "name": "Q1 Report",
  "state": "completed",
  "extraction": {
    "common": {
      "mimeType": "application/pdf",
      "size": 204800,
      "thumbnail": "https://cdn.exabase.io/...",
      "chunkCount": 42
    },
    "document": {
      "pages": 18,
      "author": "Jane Smith",
      "title": "Q1 Report",
      "creationDate": "2025-03-15T08:00:00.000Z",
      "pdfRender": { "url": "https://cdn.exabase.io/..." }
    }
  }
}

What is Exabase's PDF to JSON API?

Submit a PDF file or URL and get structured JSON back. The Exabase Extract API returns the document's metadata, text content, and a CDN-hosted thumbnail and rendered version, all in one response. It handles both digital and scanned PDFs.

What you get back

For a PDF, the response includes page count, author, title, creation date, MIME type, file size, a thumbnail of the first page, and a rendered version of the document. Text content is split into searchable chunks you can retrieve page by page. The API reference has the complete response schema.

The same POST /v2/extract endpoint works for images (with OCR), audio and video (with transcripts and timestamps), and web pages. You learn one API and use it for everything. The Extract docs cover each content type.

What you can build with it

The structured chunks this tool returns are the same format used across the Exabase platform. You can feed them into a RAG pipeline for retrieval-augmented generation, run contract analysis by searching chunks for clauses and dates, or process invoices by pulling out line items and totals. They also work well for compliance audit workflows where every document needs to be indexed.

Store the extracted content as Resources in a Base and it becomes searchable through Deep Search at the paragraph level. Extract key facts into Memory so your agent retains what it read between sessions. Workers can re-extract updated documents on a schedule to keep everything current.

Use the API

Submit jobs with the SDK or REST API. Processing is asynchronous. You submit the job and either poll for completion or set up a webhook to get notified automatically. For most PDFs, processing completes in a few seconds.

For working implementations, see Build a Simple Document Extraction Tool or Build a Document Diff Viewer. API keys are available on the free plan.

How do I use it?

  1. Get your API key

Free, no credit card:

Sign up at exabase.io → copy your API key from the dashboard

  1. Submit a PDF

Using the SDK (Node.js):

import { Exabase } from "@exabase/sdk";
import { createReadStream } from "fs";

const api = new Exabase({
  apiKey: process.env.EXABASE_API_KEY,
});

const job = await api.extract.createFromFile({
  file: createReadStream("./report.pdf"),
  name: "Q1 Report",
});

console.log(job.id);    // extraction job id
console.log(job.state); // "pending"
  1. Get the result

Poll the job until state reaches completed:

const job = await api.extract.get({ jobId: "job-id" });

if (job.state === "completed") {
  console.log(job.extraction?.common?.mimeType);   // "application/pdf"
  console.log(job.extraction?.document?.pages);     // 18
  console.log(job.extraction?.document?.author);    // "Jane Smith"
  console.log(job.extraction?.common?.thumbnail);   // CDN thumbnail URL
  console.log(job.extraction?.common?.chunkCount);  // 42
}

API quick reference

Endpoint

Method

Description

/v2/extract

POST

Submit extraction job

/v2/extract

GET

List extraction jobs

/v2/extract/{jobId}

GET

Poll for result

/v2/extract-settings

PUT

Configure webhook

Auth: X-Api-Key header on all requests.

File retention: Stored files are retained for 1 day from job creation. Download or copy anything you need before they expire.

Need file extraction or full page content?

Exabase offers a full set of tools to extract content from any media, document or website.

All your data extraction needs in one place.

Why Exabase

Use cases

FAQs

What is Exabase?

Exabase is infrastructure for AI agents. It gives your agents memory, versioned file storage, AI deep search, and context automation through a set of APIs. Store what your agent learns, search inside any content type, and keep knowledge bases current automatically. Built for production use.

Who uses Exabase?

Developers and teams building AI agents, copilots, and RAG applications. If your agent needs to remember things between sessions, store and retrieve files, search across documents and media, or stay up to date without manual maintenance, Exabase handles that infrastructure so you can focus on your product.

How long does extraction take?

Most PDFs complete in a few seconds. Very large or complex documents may take longer. Processing is asynchronous, so your application isn't blocked while it runs.

Do I have to poll for results?

No. You can configure a webhook URL and Exabase will POST the full result to your server when processing completes. Delivery is retried with exponential backoff if your server doesn't respond with a 2xx.

Does it handle scanned PDFs?

Yes. OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the type and processes it automatically.

What happens if extraction fails?

The job state moves to failed. You can call the reprocess endpoint to retry without re-uploading the original file.

How long are files retained?

Stored files are retained for 1 day from job creation, then permanently deleted. Download or copy any files you need before they expire.

Can I submit a URL instead of uploading a file?

Yes. Pass a url field in the JSON request body pointing to a publicly accessible PDF. Exabase fetches and processes it.

Can I extract specific pages only?

You can fetch specific chunk ranges using the start and end parameters on the chunks endpoint. Each chunk includes a pageNumber field so you can filter by page on your end.

Does it work with password-protected PDFs?

No, password protection must be removed first.

What other file types does Extract support?

The same API handles images (with OCR), audio (with transcription), video (with transcription), and web pages (with metadata extraction). The response structure adapts to the content type.

What's the difference between this and the PDF thumbnail generator?

The thumbnail generator returns a preview image of a PDF page. Extract returns the actual content: metadata, text chunks, and document properties. One shows what the page looks like. The other tells you what it says. See PDF Thumbnail Generator.

Can I use the extracted content with other Exabase features?

Yes. Store extracted content as a Resource in Exabase and it becomes searchable through Deep Search. Workers can re-extract updated documents on a schedule. Extract is one entry point into the full platform.

Is there an SDK?

Yes. The @exabase/sdk package for Node.js/TypeScript handles file streaming, job polling, and chunk retrieval. Install with npm install @exabase/sdk.

Deciding?

Ask your favourite AI about Exabase:

Ship your first app in minutes.

Ship your first app in minutes.