Extract

Turn PDFs, website, images, audio and video into structured data.

One API, multi-modal extraction.

AI-native file storage for agents and apps. Files, notes, bookmarks, documents.

Full CRUD, semantic search, folders and tags.

Read the docs

// Submit
{ "url": "https://example.com/report.pdf" }

// Result
{
  "state": "completed",
  "kind": "document",
  "extraction": {
    "common": {
      "mimeType": "application/pdf",
      "size": 204800,
      "thumbnail": "https://cdn.exabase.io/...",
      "chunkCount": 42
    },
    "document": {
      "pages": 18,
      "author": "Jane Smith",
      "title": "Q1 Report"
    }
  }
}

// Submit
{ "url": "https://example.com/report.pdf" }

// Result
{
  "state": "completed",
  "kind": "document",
  "extraction": {
    "common": {
      "mimeType": "application/pdf",
      "size": 204800,
      "thumbnail": "https://cdn.exabase.io/...",
      "chunkCount": 42
    },
    "document": {
      "pages": 18,
      "author": "Jane Smith",
      "title": "Q1 Report"
    }
  }
}

What is Extract?

Exabase Extract takes any file or URL and returns structured JSON. Metadata, text chunks, and thumbnails. One endpoint for PDFs, images, audio, video, and web pages.

Text is split into chunks with page numbers or timestamps. Processing is asynchronous. Poll for results or get notified via webhook.

Problem

Parsing is a full-time job.

Every source is different.

PDFs have tables that break across pages. Web pages hide content behind JavaScript. Audio needs transcription. Images need OCR.

You write a parser for each one, then have to maintain it forever.

Solution

One endpoint. Any source.

Send Exabase Extract a URL or a file.

It handles the parsing, the processing, and the edge cases.

You get structured JSON back with metadata, text chunks, and thumbnails. Ready for your pipeline or your agent.

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Production-ready

Private by design

Security-first

Scalable

Supported input types

All types go through the same POST /v2/extract endpoint. The response structure adapts to the content type.

PDFs and documents

Page count, author, title, creation date, rendered PDF, thumbnail, text chunks with page numbers.

Images

Width, height, OCR text, thumbnail.

Audio

Transcript with timestamps, duration.

Video

Transcript with timestamps, duration, dimensions.

Web pages and bookmarks

Title, site name, page content as text chunks.

What you get back

All types go through the same POST /v2/extract endpoint. The response structure adapts to the content type.

Non-exhaustive list below:

Extracted content

Field

Type

Description

text

string

The extracted text

sequence

number

Chunk order (1, 2, 3...)

pageNumber

number

Source page (documents)

timeStart

number

Start timestamp in seconds (audio/video)

timeEnd

number

End timestamp in seconds (audio/video)

Document properties

Field

Type

Description

pages

number

Page count

author

string

Document author

title

string

Document title

creationDate

string

Creation date (ISO 8601)

pdfRender.url

string

CDN URL for rendered version

Media properties

Field

Type

Description

width

number

Width in pixels

height

number

Height in pixels

duration

number

Duration in seconds (audio/video)

ocr

string

OCR text (images)

transcript

string

Transcript (audio/video)

Web properties

Field

Type

Description

title

string

Page title

siteName

string

Site name

Common metadata

Field

Type

Description

mimeType

string

Detected MIME type

size

number

File size in bytes

thumbnail

string

CDN-hosted thumbnail URL

chunkCount

number

Number of text chunks extracted

How do I use it?

Submit a file or URL

Using the SDK:

import { Exabase } from "@exabase/sdk";
import { createReadStream } from "fs";

const api = new Exabase({
  apiKey: process.env.EXABASE_API_KEY,
});

const job = await api.extract.createFromFile({
  file: createReadStream("./report.pdf"),
  name: "Q1 Report",
});

console.log(job.id);    // extraction job id
console.log(job.state); // "pending"

Get the result

Poll until state reaches completed, or use webhooks to skip polling entirely.

const job = await api.extract.get({ jobId: "job-id" });

if (job.state === "completed") {
  console.log(job.extraction?.document?.pages);     // 18
  console.log(job.extraction?.document?.author);    // "Jane Smith"
  console.log(job.extraction?.common?.thumbnail);   // CDN thumbnail URL
  console.log(job.extraction?.common?.chunkCount);  // 42
}

Read the text content

Extracted text is split into chunks. Fetch them in pages:

const result = await api.extract.getChunks({
  jobId: "job-id",
  start: 1,
  end: 20,
});

for (const chunk of result.items) {
  console.log(chunk.text);
  console.log(chunk.pageNumber);  // documents
  console.log(chunk.timeStart);   // audio/video
}

Or skip polling with webhooks

Set up a webhook URL once. Exabase sends a POST request with the full result whenever a job completes. Failed deliveries are retried with exponential backoff.

curl https://api.exabase.io/v2/extract-settings \
  -X PUT \
  -H 'X-Api-Key: YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{ "webhookUrl": "https://your-server.com/webhooks/exabase" }'

Extract API at a glance

Endpoint

Method

Description

/v2/extract

POST

Submit extraction job

/v2/extract

GET

List extraction jobs

/v2/extract/{jobId}

GET

Get extraction result

/v2/extract/{jobId}/chunks

GET

Fetch text chunks

/v2/extract/{jobId}/download

GET

Download all attachments (ZIP)

/v2/extract/{jobId}/reprocess

POST

Retry a failed job

View full API reference →

Use cases

RAG pipelines

Extract text chunks from documents and feed them into your retrieval-augmented generation pipeline. Each chunk includes a page number or timestamp for source attribution.

Invoice and receipt processing

Submit invoices, extract text and metadata, then use the chunks to identify line items, totals, and vendor info.

Invoice and receipt processing

Submit invoices, extract text and metadata, then use the chunks to identify line items, totals, and vendor info.

Contract analysis

Extract text from legal PDFs, then search across chunks to find clauses, dates, or parties.

Contract analysis

Extract text from legal PDFs, then search across chunks to find clauses, dates, or parties.

Document management

Auto-extract metadata and generate thumbnails for every file uploaded to your platform. Display page count, author, and a visual preview without opening the file.

Audio and video transcription

Submit recordings and get back transcripts with speaker timestamps. Index the chunks for searchable audio and video libraries.

Audio and video transcription

Submit recordings and get back transcripts with speaker timestamps. Index the chunks for searchable audio and video libraries.

Web page archiving

Submit URLs and get back structured content. Pages rendered with JavaScript are handled automatically.

Web page archiving

Submit URLs and get back structured content. Pages rendered with JavaScript are handled automatically.

Knowledge base ingestion

Extract content from a library of files and store it as searchable Resources in Exabase. Workers can re-extract updated documents on a schedule.

Research paper indexing

Submit academic PDFs, extract text and metadata (author, title, creation date), and build a searchable research library.

Research paper indexing

Submit academic PDFs, extract text and metadata (author, title, creation date), and build a searchable research library.

Image OCR

Submit images of receipts, business cards, whiteboards, or handwritten notes. Get back OCR text as structured chunks.

Image OCR

Submit images of receipts, business cards, whiteboards, or handwritten notes. Get back OCR text as structured chunks.

Ready for scale

Fast

Deliver speed to your users. Infrastructure that won't slow your agent down.

Secure

Encrypted in transit (SSL) and at rest (AES-256). CASA certified.

Reliable

99.9% uptime. Built on Exabase's consumer-grade scaled infrastructure.

Ready for scale

Fast

Deliver speed to your users. Infrastructure that won't slow your agent down.

Secure

Encrypted in transit (SSL) and at rest (AES-256). CASA certified.

Reliable

99.9% uptime. Built on Exabase's consumer-grade scaled infrastructure.

Ready for scale

Fast

Deliver speed to your users. Infrastructure that won't slow your agent down.

Secure

Encrypted in transit (SSL) and at rest (AES-256). CASA certified.

Reliable

99.9% uptime. Built on Exabase's consumer-grade scaled infrastructure.

Why Exabase

One endpoint for everything

PDFs, images, audio, video, web pages. The same POST /v2/extract handles all of them. The response structure adapts to the content type. Learn one API, extract from anything.

One endpoint for everything

PDFs, images, audio, video, web pages. The same POST /v2/extract handles all of them. The response structure adapts to the content type. Learn one API, extract from anything.

Text as searchable chunks

Extracted text is split into structured chunks with page numbers (documents) or timestamp ranges (audio/video). Feed them into RAG pipelines, build search indexes, or pass them as LLM context. Not a single giant text blob.

Text as searchable chunks

Thumbnails and renders included

Every extraction job generates a CDN-hosted thumbnail and (for PDFs) a rendered version of the document. Use them for previews in your UI without running a separate service.

Thumbnails and renders included

Every extraction job generates a CDN-hosted thumbnail and (for PDFs) a rendered version of the document. Use them for previews in your UI without running a separate service.

Async processing that scales

Submit jobs and move on. Poll for results or configure webhooks. No 30-second timeout limits. Process thousands of files in parallel.

Async processing that scales

Submit jobs and move on. Poll for results or configure webhooks. No 30-second timeout limits. Process thousands of files in parallel.

Webhooks with delivery logs

Configure a webhook URL and get notified on completion. Every delivery attempt is logged. You can manually re-trigger a webhook if your server missed it.

Webhooks with delivery logs

Configure a webhook URL and get notified on completion. Every delivery attempt is logged. You can manually re-trigger a webhook if your server missed it.

Reprocess without re-uploading

If a job fails, call the reprocess endpoint. The API retries using the original file. No need to upload again.

Reprocess without re-uploading

If a job fails, call the reprocess endpoint. The API retries using the original file. No need to upload again.

Part of the full platform

Extracted content connects directly to the rest of Exabase. Store it as a Resource, search it with Deep Search, maintain it with Workers. Extract a document, and it's immediately searchable. No glue code, no separate storage layer.

Part of the full platform

Part of a full platform

Memory

Self-managing advanced memory system.

Bases

Isolated per-tenant instances with version rollback.

Resources

Files, notes, links. Your portable context server.

Deep Search

Sub-document multi-modal hybrid search.

Extract

Structured data from PDFs, websites, images, and more.

Workers

Autonomous agents and self-enriching knowledge bases.

Memory

Self-managing advanced memory system.

Bases

Isolated per-tenant instances with version rollback.

Resources

Files, notes, links. Your portable context server.

Deep Search

Sub-document multi-modal hybrid search.

Extract

Structured data from PDFs, websites, images, and more.

Workers

Autonomous agents and self-enriching knowledge bases.

Works with everything

Model-agnostic. Framework-agnostic.

SDK

TypeScript SDK. A convenient toolkit to move faster.

API

Simple, clean API. Up and running in under a minute.

Works with everything

Model-agnostic. Framework-agnostic.

SDK

TypeScript SDK. A convenient toolkit to move faster.

API

Simple, clean API. Up and running in under a minute.

Works with everything

Model-agnostic. Framework-agnostic.

SDK

TypeScript SDK. A convenient toolkit to move faster.

API

Simple, clean API. Up and running in under a minute.

FAQs

What is Exabase?

Exabase is infrastructure for AI agents. It gives your agents memory, versioned file storage, AI deep search, and context automation through a set of APIs. Store what your agent learns, search inside any content type, and keep knowledge bases current automatically. Built for production use.

Who uses Exabase?

Developers and teams building AI agents, copilots, and RAG applications. If your agent needs to remember things between sessions, store and retrieve files, search across documents and media, or stay up to date without manual maintenance, Exabase handles that infrastructure so you can focus on your product.

How long does extraction take?

Most files complete in a few seconds. Very large or complex documents may take longer. Processing is asynchronous, so your application isn't blocked while it runs.

What file types are supported?

PDFs, Word documents, images (JPEG, PNG, WebP, GIF), audio files, video files, and web pages/URLs. The API auto-detects the content type and processes it accordingly.

Does it handle scanned PDFs?

Yes. OCR is built in. Submit a scanned PDF the same way you'd submit a digital one. The API detects the type and processes it automatically.

Do I have to poll for results?

No. You can configure a webhook URL and Exabase will POST the full result to your server when processing completes. Delivery is retried with exponential backoff if your server doesn't respond with a 2xx.

What happens if extraction fails?

The job state moves to failed. You can call the reprocess endpoint to retry without re-uploading the original file.

How long are files retained?

Stored files are retained for 1 day from job creation, then permanently deleted. Download or copy any files you need before they expire.

Can I submit a URL instead of uploading a file?

Yes. Pass a url field in the JSON request body pointing to a publicly accessible file or web page. Exabase fetches and processes it.

What do text chunks look like?

Each chunk contains the extracted text, a sequence number, and a location reference. For documents, that's the page number. For audio and video, that's the start and end timestamps in seconds.

Does it work with web pages that use JavaScript?

Yes. Web pages are rendered with JavaScript before extraction. Content that loads dynamically is captured.

Can I use the extracted content with other Exabase features?

Yes. Store extracted content as a Resource in Exabase and it becomes searchable through Deep Search. Workers can re-extract updated documents on a schedule. Extract is one entry point into the full platform.

Is there an SDK?

Yes. The @exabase/sdk package for Node.js/TypeScript handles file streaming, job polling, and chunk retrieval. Install with npm install @exabase/sdk. Or call the REST API directly from any language.

What's the difference between Extract and Deep Search?

Extract processes files and returns structured data. Deep Search searches across content you've already stored. They work together: extract a document, store it as a Resource, and it becomes searchable through Deep Search.

What's the difference between Extract and the link preview API?

The link preview API returns metadata only (title, description, image, URL). Extract returns the full page content as structured text chunks. Link preview is a lightweight GET request. Extract is async processing for deeper extraction.

Deciding?

Ask your favourite AI about Exabase:

Ship your first app in minutes.

Start for free

Read the docs

Ship your first app in minutes.

Start for free

Read the docs

Extract

Extract

Turn PDFs, website, images, audio and video into structured data.

One API, multi-modal extraction.

AI-native file storage for agents and apps. Files, notes, bookmarks, documents.

Full CRUD, semantic search, folders and tags.

What is Extract?

Parsing is a full-time job.

One endpoint. Any source.

Supported input types

What you get back

Extracted content

Document properties

Media properties

Web properties

Common metadata

How do I use it?

Submit a file or URL

Get the result

Read the text content

Or skip polling with webhooks

Extract API at a glance

Use cases

Ready for scale

Ready for scale

Ready for scale

Why Exabase

Part of a full platform

Works with everything

Works with everything

Works with everything

FAQs

Ship your first app in minutes.

Ship your first app in minutes.

Part of the family:

Part of the family:

Extract

Extract

Turn PDFs, website, images, audio and video into structured data.One API, multi-modal extraction.

AI-native file storage for agents and apps. Files, notes, bookmarks, documents. Full CRUD, semantic search, folders and tags.

What is Extract?

Parsing is a full-time job.

One endpoint. Any source.

Supported input types

What you get back

Extracted content

Document properties

Media properties

Web properties

Common metadata

How do I use it?

Submit a file or URL

Get the result

Read the text content

Or skip polling with webhooks

Extract API at a glance

Use cases

Ready for scale

Ready for scale

Ready for scale

Why Exabase

Part of a full platform

Works with everything

Works with everything

Works with everything

FAQs

Ship your first app in minutes.

Ship your first app in minutes.

Turn PDFs, website, images, audio and video into structured data.

One API, multi-modal extraction.

AI-native file storage for agents and apps. Files, notes, bookmarks, documents.

Full CRUD, semantic search, folders and tags.