Extract

Extract

Turn PDFs, website, images, audio and video into structured data.

One API, multi-modal extraction.

AI-native file storage for agents and apps. Files, notes, bookmarks, documents.

Full CRUD, semantic search, folders and tags.

// Submit
{ "url": "https://example.com/report.pdf" }

// Result
{
  "state": "completed",
  "kind": "document",
  "extraction": {
    "common": {
      "mimeType": "application/pdf",
      "size": 204800,
      "thumbnail": "https://cdn.exabase.io/...",
      "chunkCount": 42
    },
    "document": {
      "pages": 18,
      "author": "Jane Smith",
      "title": "Q1 Report"
    }
  }
}
// Submit
{ "url": "https://example.com/report.pdf" }

// Result
{
  "state": "completed",
  "kind": "document",
  "extraction": {
    "common": {
      "mimeType": "application/pdf",
      "size": 204800,
      "thumbnail": "https://cdn.exabase.io/...",
      "chunkCount": 42
    },
    "document": {
      "pages": 18,
      "author": "Jane Smith",
      "title": "Q1 Report"
    }
  }
}

What is Extract?

Exabase Extract takes any file or URL and returns structured JSON. Metadata, text chunks, and thumbnails. One endpoint for PDFs, images, audio, video, and web pages.

Text is split into chunks with page numbers or timestamps. Processing is asynchronous. Poll for results or get notified via webhook.

Problem

Parsing is a full-time job.

Every source is different.


PDFs have tables that break across pages. Web pages hide content behind JavaScript. Audio needs transcription. Images need OCR.


You write a parser for each one, then have to maintain it forever.

Solution

One endpoint. Any source.

Send Exabase Extract a URL or a file.


It handles the parsing, the processing, and the edge cases.


You get structured JSON back with metadata, text chunks, and thumbnails. Ready for your pipeline or your agent.

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Supported input types

All types go through the same POST /v2/extract endpoint. The response structure adapts to the content type.

PDFs and documents

Page count, author, title, creation date, rendered PDF, thumbnail, text chunks with page numbers.

Images

Width, height, OCR text, thumbnail.

Audio

Transcript with timestamps, duration.

Video

Transcript with timestamps, duration, dimensions.

Web pages and bookmarks

Title, site name, page content as text chunks.

What you get back

All types go through the same POST /v2/extract endpoint. The response structure adapts to the content type.

Non-exhaustive list below:

Extracted content

Field

Type

Description

text

string

The extracted text

sequence

number

Chunk order (1, 2, 3...)

pageNumber

number

Source page (documents)

timeStart

number

Start timestamp in seconds (audio/video)

timeEnd

number

End timestamp in seconds (audio/video)

Document properties

Field

Type

Description

pages

number

Page count

author

string

Document author

title

string

Document title

creationDate

string

Creation date (ISO 8601)

pdfRender.url

string

CDN URL for rendered version

Media properties

Field

Type

Description

width

number

Width in pixels

height

number

Height in pixels

duration

number

Duration in seconds (audio/video)

ocr

string

OCR text (images)

transcript

string

Transcript (audio/video)

Web properties

Field

Type

Description

title

string

Page title

siteName

string

Site name

Common metadata

Field

Type

Description

mimeType

string

Detected MIME type

size

number

File size in bytes

thumbnail

string

CDN-hosted thumbnail URL

chunkCount

number

Number of text chunks extracted

How do I use it?

  1. Submit a file or URL

Using the SDK:

import { Exabase } from "@exabase/sdk";
import { createReadStream } from "fs";

const api = new Exabase({
  apiKey: process.env.EXABASE_API_KEY,
});

const job = await api.extract.createFromFile({
  file: createReadStream("./report.pdf"),
  name: "Q1 Report",
});

console.log(job.id);    // extraction job id
console.log(job.state); // "pending"
  1. Get the result

Poll until state reaches completed, or use webhooks to skip polling entirely.

const job = await api.extract.get({ jobId: "job-id" });

if (job.state === "completed") {
  console.log(job.extraction?.document?.pages);     // 18
  console.log(job.extraction?.document?.author);    // "Jane Smith"
  console.log(job.extraction?.common?.thumbnail);   // CDN thumbnail URL
  console.log(job.extraction?.common?.chunkCount);  // 42
}
  1. Read the text content

Extracted text is split into chunks. Fetch them in pages:

const result = await api.extract.getChunks({
  jobId: "job-id",
  start: 1,
  end: 20,
});

for (const chunk of result.items) {
  console.log(chunk.text);
  console.log(chunk.pageNumber);  // documents
  console.log(chunk.timeStart);   // audio/video
}

Or skip polling with webhooks

Set up a webhook URL once. Exabase sends a POST request with the full result whenever a job completes. Failed deliveries are retried with exponential backoff.

curl https://api.exabase.io/v2/extract-settings \
  -X PUT \
  -H 'X-Api-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{ "webhookUrl": "https://your-server.com/webhooks/exabase" }'

Extract API at a glance

Endpoint

Method

Description

/v2/extract

POST

Submit extraction job

/v2/extract

GET

List extraction jobs

/v2/extract/{jobId}

GET

Get extraction result

/v2/extract/{jobId}/chunks

GET

Fetch text chunks

/v2/extract/{jobId}/download

GET

Download all attachments (ZIP)

/v2/extract/{jobId}/reprocess

POST

Retry a failed job

Use cases

Why Exabase

FAQs

What is Exabase?

Exabase is infrastructure for AI agents. It gives your agents memory, versioned file storage, AI deep search, and context automation through a set of APIs. Store what your agent learns, search inside any content type, and keep knowledge bases current automatically. Built for production use.

Who uses Exabase?

Developers and teams building AI agents, copilots, and RAG applications. If your agent needs to remember things between sessions, store and retrieve files, search across documents and media, or stay up to date without manual maintenance, Exabase handles that infrastructure so you can focus on your product.

How long does extraction take?

Most files complete in a few seconds. Very large or complex documents may take longer. Processing is asynchronous, so your application isn't blocked while it runs.

What file types are supported?

PDFs, Word documents, images (JPEG, PNG, WebP, GIF), audio files, video files, and web pages/URLs. The API auto-detects the content type and processes it accordingly.

Does it handle scanned PDFs?

Yes. OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the type and processes it automatically.

Do I have to poll for results?

No. You can configure a webhook URL and Exabase will POST the full result to your server when processing completes. Delivery is retried with exponential backoff if your server doesn't respond with a 2xx.

What happens if extraction fails?

The job state moves to failed. You can call the reprocess endpoint to retry without re-uploading the original file.

How long are files retained?

Stored files are retained for 1 day from job creation, then permanently deleted. Download or copy any files you need before they expire.

Can I submit a URL instead of uploading a file?

Yes. Pass a url field in the JSON request body pointing to a publicly accessible file or web page. Exabase fetches and processes it.

What do text chunks look like?

Each chunk contains the extracted text, a sequence number, and a location reference. For documents, that's the page number. For audio and video, that's the start and end timestamps in seconds.

Does it work with web pages that use JavaScript?

Yes. Web pages are rendered with JavaScript before extraction. Content that loads dynamically is captured.

Can I use the extracted content with other Exabase features?

Yes. Store extracted content as a Resource in Exabase and it becomes searchable through Deep Search. Workers can re-extract updated documents on a schedule. Extract is one entry point into the full platform.

Is there an SDK?

Yes. The @exabase/sdk package for Node.js/TypeScript handles file streaming, job polling, and chunk retrieval. Install with npm install @exabase/sdk. Or call the REST API directly from any language.

What's the difference between Extract and Deep Search?

Extract processes files and returns structured data. Deep Search searches across content you've already stored. They work together: extract a document, store it as a Resource, and it becomes searchable through Deep Search.

What's the difference between Extract and the link preview API?

The link preview API returns metadata only (title, description, image, URL). Extract returns the full page content as structured text chunks. Link preview is a lightweight GET request. Extract is async processing for deeper extraction.

Deciding?

Ask your favourite AI about Exabase:

Ship your first app in minutes.

Ship your first app in minutes.