Tools

`PDF to JSON API`

Submit any PDF. Get back structured metadata, text content, thumbnails, and document properties as JSON. Digital and scanned.

Try it below:

FILEDrop a file or click to upload

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Turn PDFs into structured JSON

Upload a PDF or submit a PDF URL, get structured JSON back.

{
  "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "kind": "document",
  "name": "Q1 Report",
  "state": "completed",
  "extraction": {
    "common": {
      "mimeType": "application/pdf",
      "size": 204800,
      "thumbnail": "https://cdn.exabase.io/...",
      "chunkCount": 42
    },
    "document": {
      "pages": 18,
      "author": "Jane Smith",
      "title": "Q1 Report",
      "creationDate": "2025-03-15T08:00:00.000Z",
      "pdfRender": { "url": "https://cdn.exabase.io/..." }
    }
  }
}

Start for free

Read the docs

What is Exabase's PDF to JSON API?

Submit a PDF file or URL and get structured JSON back. The Exabase Extract API returns the document's metadata, text content, and a CDN-hosted thumbnail and rendered version, all in one response. It handles both digital and scanned PDFs.

What you get back

For a PDF, the response includes page count, author, title, creation date, MIME type, file size, a thumbnail of the first page, and a rendered version of the document. Text content is split into searchable chunks you can retrieve page by page. The API reference has the complete response schema.

The same POST /v2/extract endpoint works for images (with OCR), audio and video (with transcripts and timestamps), and web pages. You learn one API and use it for everything. The Extract docs cover each content type.

What you can build with it

The structured chunks this tool returns are the same format used across the Exabase platform. You can feed them into a RAG pipeline for retrieval-augmented generation, run contract analysis by searching chunks for clauses and dates, or process invoices by pulling out line items and totals. They also work well for compliance audit workflows where every document needs to be indexed.

Store the extracted content as Resources in a Base and it becomes searchable through Deep Search at the paragraph level. Extract key facts into Memory so your agent retains what it read between sessions. Workers can re-extract updated documents on a schedule to keep everything current.

Use the API

Submit jobs with the SDK or REST API. Processing is asynchronous. You submit the job and either poll for completion or set up a webhook to get notified automatically. For most PDFs, processing completes in a few seconds.

For working implementations, see Build a Simple Document Extraction Tool or Build a Document Diff Viewer. API keys are available on the free plan.

Start for free

Read the docs

How do I use it?

Get your API key

Free, no credit card:

Submit a PDF

Using the SDK (Node.js):

import { Exabase } from "@exabase/sdk";
import { createReadStream } from "fs";

const api = new Exabase({
  apiKey: process.env.EXABASE_API_KEY,
});

const job = await api.extract.createFromFile({
  file: createReadStream("./report.pdf"),
  name: "Q1 Report",
});

console.log(job.id);    // extraction job id
console.log(job.state); // "pending"

Get the result

Poll the job until state reaches completed:

const job = await api.extract.get({ jobId: "job-id" });

if (job.state === "completed") {
  console.log(job.extraction?.common?.mimeType);   // "application/pdf"
  console.log(job.extraction?.document?.pages);     // 18
  console.log(job.extraction?.document?.author);    // "Jane Smith"
  console.log(job.extraction?.common?.thumbnail);   // CDN thumbnail URL
  console.log(job.extraction?.common?.chunkCount);  // 42
}

API quick reference

Endpoint

Method

Description

/v2/extract

POST

Submit extraction job

/v2/extract

GET

List extraction jobs

/v2/extract/{jobId}

GET

Poll for result

/v2/extract-settings

PUT

Configure webhook

Auth: X-Api-Key header on all requests.

File retention: Stored files are retained for 1 day from job creation. Download or copy anything you need before they expire.

Need file extraction or full page content?

Exabase offers a full set of tools to extract content from any media, document or website.

All your data extraction needs in one place.

Learn about Exabase Extract →

Why Exabase

Async processing that scales

Submit thousands of PDFs and let the pipeline handle them. No timeouts, no 30-second request limits. Poll for results or set up webhooks and process completions as they arrive.

Async processing that scales

Submit thousands of PDFs and let the pipeline handle them. No timeouts, no 30-second request limits. Poll for results or set up webhooks and process completions as they arrive.

Text as searchable chunks

Extracted text is split into structured chunks with page numbers. Build search, feed chunks into RAG pipelines, or index them directly. Not a single giant text blob.

Text as searchable chunks

Extracted text is split into structured chunks with page numbers. Build search, feed chunks into RAG pipelines, or index them directly. Not a single giant text blob.

Thumbnails and renders included

Every extraction job generates a CDN-hosted thumbnail and a rendered version of the document. Use them for previews in your UI without running a separate thumbnail service.

Thumbnails and renders included

Every extraction job generates a CDN-hosted thumbnail and a rendered version of the document. Use them for previews in your UI without running a separate thumbnail service.

Scanned PDFs handled automatically

OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the document type and processes it accordingly.

Scanned PDFs handled automatically

OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the document type and processes it accordingly.

One API for everything

The same POST /v2/extract endpoint handles PDFs, images, audio, video, and web pages. The response structure adapts to the content type. Learn one API, extract from anything.

One API for everything

The same POST /v2/extract endpoint handles PDFs, images, audio, video, and web pages. The response structure adapts to the content type. Learn one API, extract from anything.

Webhooks with delivery logs

No need to poll. Configure a webhook URL and get notified on completion. Every delivery attempt is logged with status codes and response bodies. You can manually re-trigger a webhook if your server missed it.

Webhooks with delivery logs

Reprocess without re-uploading

If a job fails, call the reprocess endpoint. The API retries using the original file without requiring a new upload.

Reprocess without re-uploading

If a job fails, call the reprocess endpoint. The API retries using the original file without requiring a new upload.

SDK or raw HTTP

Use the @exabase/sdk package for Node.js/TypeScript with streaming file uploads. Or call the REST API directly from any language.

SDK or raw HTTP

Use the @exabase/sdk package for Node.js/TypeScript with streaming file uploads. Or call the REST API directly from any language.

Not just extraction

Extracted content can feed directly into the rest of the Exabase platform. Store it as a searchable Resource, index it with Deep Search, keep it current with Workers. Extract is one part of a complete data layer for agents.

Not just extraction

vs pdfplumber, PyPDF, pdfminer

OpenGraph.io charges $15/mo for 1,000 requests. Exabase gives you 60/hour (up to ~43,000/month) free.

vs pdfplumber, PyPDF, pdfminer

OpenGraph.io charges $15/mo for 1,000 requests. Exabase gives you 60/hour (up to ~43,000/month) free.

vs Reducto, Extend, Unstructured

They return results and stop. Exabase gives you extraction plus an optional platform to store, search, and maintain the extracted content. Works great standalone. Works better together.

vs Reducto, Extend, Unstructured

They return results and stop. Exabase gives you extraction plus an optional platform to store, search, and maintain the extracted content. Works great standalone. Works better together.

Works with other Exabase features

Memory

Self-managing advanced memory system.

Bases

Isolated per-tenant instances with version rollback.

Resources

Files, notes, links. Your portable context server.

Deep Search

Sub-document multi-modal hybrid search.

Extract

Structured data from PDFs, websites, images, and more.

Workers

Autonomous agents and self-enriching knowledge bases.

Memory

Self-managing advanced memory system.

Bases

Isolated per-tenant instances with version rollback.

Resources

Files, notes, links. Your portable context server.

Deep Search

Sub-document multi-modal hybrid search.

Extract

Structured data from PDFs, websites, images, and more.

Workers

Autonomous agents and self-enriching knowledge bases.

Ready for scale

Fast

Deliver speed to your users. Infrastructure that won't slow your agent down.

Secure

Encrypted in transit (SSL) and at rest (AES-256). CASA certified.

Reliable

99.9% uptime. Built on Exabase's consumer-grade scaled infrastructure.

Ready for scale

Fast

Deliver speed to your users. Infrastructure that won't slow your agent down.

Secure

Encrypted in transit (SSL) and at rest (AES-256). CASA certified.

Reliable

99.9% uptime. Built on Exabase's consumer-grade scaled infrastructure.

Ready for scale

Fast

Deliver speed to your users. Infrastructure that won't slow your agent down.

Secure

Encrypted in transit (SSL) and at rest (AES-256). CASA certified.

Reliable

99.9% uptime. Built on Exabase's consumer-grade scaled infrastructure.

Use cases

RAG pipelines

Extract text chunks from PDFs and feed them into your retrieval-augmented generation pipeline. Each chunk includes a page number for source attribution.

Invoice processing

Submit invoices, extract text and metadata, then use the structured chunks to identify line items, totals, and vendor info.

Invoice processing

Submit invoices, extract text and metadata, then use the structured chunks to identify line items, totals, and vendor info.

Contract analysis

Extract text from legal PDFs, then search across chunks to find specific clauses, dates, or parties.

Contract analysis

Extract text from legal PDFs, then search across chunks to find specific clauses, dates, or parties.

Research paper indexing

Submit academic PDFs, extract text chunks and metadata (author, title, creation date), and build a searchable research library.

Research paper indexing

Submit academic PDFs, extract text chunks and metadata (author, title, creation date), and build a searchable research library.

Document management systems

Auto-extract metadata and generate thumbnails for every PDF uploaded to your platform. Display page count, author, and a visual preview without opening the file.

Medical records processing

Extract text from clinical documents. Each chunk maps to a page for audit traceability.

Medical records processing

Extract text from clinical documents. Each chunk maps to a page for audit traceability.

Insurance claims intake

Submit claims documents, extract structured data, and route based on content.

Insurance claims intake

Submit claims documents, extract structured data, and route based on content.

Compliance and audit

Extract and index policy documents. Search across all extracted content to verify compliance with specific requirements.

Compliance and audit

Extract and index policy documents. Search across all extracted content to verify compliance with specific requirements.

Knowledge base ingestion

Extract content from a library of PDFs and store it as searchable Resources in Exabase. Workers can re-extract updated documents on a schedule.

FAQs

What is Exabase?

Exabase is infrastructure for AI agents. It gives your agents memory, versioned file storage, AI deep search, and context automation through a set of APIs. Store what your agent learns, search inside any content type, and keep knowledge bases current automatically. Built for production use.

Who uses Exabase?

Developers and teams building AI agents, copilots, and RAG applications. If your agent needs to remember things between sessions, store and retrieve files, search across documents and media, or stay up to date without manual maintenance, Exabase handles that infrastructure so you can focus on your product.

How long does extraction take?

Most PDFs complete in a few seconds. Very large or complex documents may take longer. Processing is asynchronous, so your application isn't blocked while it runs.

Do I have to poll for results?

No. You can configure a webhook URL and Exabase will POST the full result to your server when processing completes. Delivery is retried with exponential backoff if your server doesn't respond with a 2xx.

Does it handle scanned PDFs?

Yes. OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the type and processes it automatically.

What happens if extraction fails?

The job state moves to failed. You can call the reprocess endpoint to retry without re-uploading the original file.

How long are files retained?

Stored files are retained for 1 day from job creation, then permanently deleted. Download or copy any files you need before they expire.

Can I submit a URL instead of uploading a file?

Yes. Pass a url field in the JSON request body pointing to a publicly accessible PDF. Exabase fetches and processes it.

Can I extract specific pages only?

You can fetch specific chunk ranges using the start and end parameters on the chunks endpoint. Each chunk includes a pageNumber field so you can filter by page on your end.

Does it work with password-protected PDFs?

No, password protection must be removed first.

What other file types does Extract support?

The same API handles images (with OCR), audio (with transcription), video (with transcription), and web pages (with metadata extraction). The response structure adapts to the content type.

What's the difference between this and the PDF thumbnail generator?

The thumbnail generator returns a preview image of a PDF page. Extract returns the actual content: metadata, text chunks, and document properties. One shows what the page looks like. The other tells you what it says. See PDF Thumbnail Generator.

Can I use the extracted content with other Exabase features?

Yes. Store extracted content as a Resource in Exabase and it becomes searchable through Deep Search. Workers can re-extract updated documents on a schedule. Extract is one entry point into the full platform.

Is there an SDK?

Yes. The @exabase/sdk package for Node.js/TypeScript handles file streaming, job polling, and chunk retrieval. Install with npm install @exabase/sdk.

Other tools

Link preview API

PDF thumbnail generator

PDF to JSON

Link preview API

PDF thumbnail generator

PDF to JSON

Deciding?

Ask your favourite AI about Exabase:

Ship your first app in minutes.

Start for free

Read the docs

Ship your first app in minutes.

Start for free

Read the docs

`PDF to JSON API`

Submit any PDF. Get back structured metadata, text content, thumbnails, and document properties as JSON. Digital and scanned.

Try it below:

Turn PDFs into structured JSON

What is Exabase's PDF to JSON API?

What you get back

What you can build with it

Use the API

How do I use it?

Get your API key

Submit a PDF

Get the result

API quick reference

Need file extraction or full page content?

Why Exabase

Works with other Exabase features

Ready for scale

Ready for scale

Ready for scale

Use cases

FAQs

Other tools

Ship your first app in minutes.

Ship your first app in minutes.

Part of the family:

Part of the family: