Examples

Build a Document Extraction Tool with Exabase

Get an API key

Read the docs

See code for this example

Extract structured data from any URL or file with a single API call.

This tutorial walks through how the Simple Extraction Demo example uses the Exabase Extraction API to turn documents, web pages, images, audio, and video into clean, structured text and metadata, all without writing a single parser.

What it does

You submit a URL or drop a file. The app sends it to the Exabase Extraction API, polls until the job finishes, then displays everything that came back: metadata, auto-generated captions and keywords, document details, paginated text chunks, and downloadable attachments. A sidebar lists recent jobs so you can revisit past extractions, and a webhook panel lets you configure a delivery URL and inspect or re-fire individual deliveries.

This is the simplest possible integration with Exabase Extract. One API call in, structured data out. The example exists to show what the extraction response looks like and how to work with each piece of it.

What is Exabase Extract?

Exabase Extract turns any source into structured data with a single API call. Send a URL, file, image, audio, or video and get clean JSON back with tables, metadata, and text hierarchy intact. No parsers to write, no edge cases to handle, no post-processing pipeline to maintain.

How Exabase fits in

Submitting a source

The app accepts two kinds of input. For URLs:

ts

const job = await getExabase().extract.create({ url, name });

const job = await getExabase().extract.create({ url, name });

const job = await getExabase().extract.create({ url, name });

For uploaded files:

ts

const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });

const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });

const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });

Both return a job object immediately. Extraction runs asynchronously, so the app needs to poll for completion.

Polling until done

The client uses React Query's refetchInterval to check the job state every 1.5 seconds, stopping once it hits a terminal state (completed, failed, or cleaned):

ts

const jobQuery = trpc.extract.get.useQuery(
  { jobId: activeJobId ?? "" },
  {
    enabled: activeJobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);

const jobQuery = trpc.extract.get.useQuery(
  { jobId: activeJobId ?? "" },
  {
    enabled: activeJobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);

const jobQuery = trpc.extract.get.useQuery(
  { jobId: activeJobId ?? "" },
  {
    enabled: activeJobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);

On the server side, each poll is a single SDK call:

ts

const job = await getExabase().extract.get({ jobId });

const job = await getExabase().extract.get({ jobId });

const job = await getExabase().extract.get({ jobId });

What comes back

Once a job completes, the extraction response includes several layers of data depending on the source type.

Common fields are always present: MIME type, file size, an auto-generated caption, keywords, a thumbnail, and a chunk count.

Document fields appear for PDFs and similar files: page count, title, author, subject, and a rendered PDF link.

Media fields appear for images, audio, and video: dimensions, duration, codec, OCR text, a transcript link, and a screenshot link.

Web fields appear for URLs: page title, site name, description, favicon, an image, and links to the raw HTML and a reader-mode view.

The app renders all of this in a single result card, organized by category. You don't have to use every field; just pick the ones relevant to your use case.

Paginating text chunks

Extracted text comes back in paginated chunks. The app fetches them in windows of 20 and lets the user load more on demand:

ts

const CHUNK_WINDOW = 20;

const res = await getExabase().extract.getChunks({
  jobId,
  start,
  end: start + CHUNK_WINDOW - 1,
});

const CHUNK_WINDOW = 20;

const res = await getExabase().extract.getChunks({
  jobId,
  start,
  end: start + CHUNK_WINDOW - 1,
});

const CHUNK_WINDOW = 20;

const res = await getExabase().extract.getChunks({
  jobId,
  start,
  end: start + CHUNK_WINDOW - 1,
});

Each chunk includes a sequence number and, depending on the source, a pageNumber (for documents) or timeStart/timeEnd (for media with transcripts).

Downloading attachments

If the extraction produced attachments (embedded images, tables, etc.), the app offers a one-click ZIP download:

ts

const blob = await getExabase().extract.downloadAttachments({ jobId });

const blob = await getExabase().extract.downloadAttachments({ jobId });

const blob = await getExabase().extract.downloadAttachments({ jobId });

This is served through a Next.js route handler that streams the ZIP directly to the browser.

Reprocessing failed jobs

If a job fails (bad URL, transient error), the app shows a Reprocess button that retries extraction on the same source:

ts

await getExabase().extract.reprocess({ jobId });

await getExabase().extract.reprocess({ jobId });

await getExabase().extract.reprocess({ jobId });

The job returns to a pending state and the polling loop picks it back up automatically.

Browsing recent jobs

The sidebar lists the 25 most recent extraction jobs using cursor-based pagination:

ts

const res = await getExabase().extract.list({ limit: 25, cursor });

const res = await getExabase().extract.list({ limit: 25, cursor });

const res = await getExabase().extract.list({ limit: 25, cursor });

Each entry shows the job name (or URL, or ID), its state badge, and is clickable to load the full result.

Webhooks

The app includes a full webhook integration. You configure a delivery URL through the settings panel:

ts

await getExabase().extractSettings.update({ webhookUrl });

await getExabase().extractSettings.update({ webhookUrl });

await getExabase().extractSettings.update({ webhookUrl });

Once configured, Exabase POSTs extraction events to that URL. The app can list delivery attempts for any job, inspect the request payload and response body for each attempt, and manually trigger a re-delivery:

ts

// List deliveries for a job
const logs = await getExabase().extract.listWebhookLogs({ jobId });

// Inspect a specific delivery
const detail = await getExabase().extract.getWebhookLog({ jobId, logId });

// Re-fire
const { traceId } = await getExabase().extract.triggerWebhook({ jobId });

// List deliveries for a job
const logs = await getExabase().extract.listWebhookLogs({ jobId });

// Inspect a specific delivery
const detail = await getExabase().extract.getWebhookLog({ jobId, logId });

// Re-fire
const { traceId } = await getExabase().extract.triggerWebhook({ jobId });

// List deliveries for a job
const logs = await getExabase().extract.listWebhookLogs({ jobId });

// Inspect a specific delivery
const detail = await getExabase().extract.getWebhookLog({ jobId, logId });

// Re-fire
const { traceId } = await getExabase().extract.triggerWebhook({ jobId });

The webhook logs panel auto-refreshes while any delivery is in a pending or retrying state.

Key Exabase APIs used in this example

API	Purpose
`extract.create`	Submit a URL for extraction
`extract.createFromFile`	Submit an uploaded file for extraction
`extract.get`	Poll job state until completed or failed
`extract.getChunks`	Paginate through extracted text chunks
`extract.list`	List recent extraction jobs
`extract.reprocess`	Retry a failed extraction
`extract.downloadAttachments`	Download all attachments as a ZIP
`extractSettings.get`	Read the current webhook URL
`extractSettings.update`	Set or clear the webhook URL
`extract.listWebhookLogs`	List delivery attempts for a job
`extract.getWebhookLog`	Inspect a single delivery attempt
`extract.triggerWebhook`	Manually re-fire a webhook for a job

Technologies

Exabase: Extraction API for submitting URLs and files, polling job state, reading chunks, downloading attachments, and managing webhook settings
Next.js: Front-end and back-end
tRPC: Type-safe API layer between client and server
shadcn/ui: Accessible UI components
Tailwind CSS: Styling
Biome: Lint and format

Run it yourself

bash

git clone https://github.com/futurebrowser/exabase-examples.git
cd

git clone https://github.com/futurebrowser/exabase-examples.git
cd

git clone https://github.com/futurebrowser/exabase-examples.git
cd

Add EXABASE_API_KEY to .env.local. Open http://localhost:3000 and submit a URL or drop a file.

FAQ

What can I extract from?

Anything Exabase Extract supports: PDFs, DOCX, images, audio, video, and web URLs. The demo handles all of them through the same two entry points (create for URLs, createFromFile for uploads).

How long does extraction take?

Small documents and simple web pages finish in a few seconds. Larger files, media with transcripts, or pages that need rendering take longer. The polling loop handles the wait automatically.

What are the attachments in the ZIP download?

These are embedded assets that Exabase extracts from the source: images, tables rendered as separate files, and similar artifacts. Not every source produces attachments.

Do I need webhooks?

No. Polling works fine for interactive use. Webhooks are useful when you want to trigger downstream processing (indexing, notifications, pipelines) without keeping a connection open. The demo includes both patterns so you can see how each works.

Could I use this as a starting point for something bigger?

That's the idea. The extraction pipeline here (submit, poll, read chunks, download attachments) is the same regardless of what you build on top of it. Swap the result viewer for a search indexer, a summarization step, a content migration tool, or anything else that needs clean text out of messy sources. The Exabase integration stays the same.

Other example projects:

Build a GitHub PR Knowledge Base with Exabase

Build a Memory-Powered Personal Assistant with Exabase

Build a Sales Memory Copilot with Exabase

Build an AI Flashcard Generator with Exabase

Build a Document Extraction Tool with Exabase

Build a Document Diff Viewer with Exabase

Cut your token spend and give your agent precise context.

Get started in minutes.

Start for free

Read the docs

Cut your token spend and give your agent precise context.

Get started in minutes.

Start for free

Read the docs

Cut your token spend and give your agent precise context.

Get started in minutes.

Start for free

Read the docs

Build a Document Extraction Tool with Exabase

What it does

What is Exabase Extract?

How Exabase fits in

Submitting a source

Polling until done

What comes back

Paginating text chunks

Downloading attachments

Reprocessing failed jobs

Browsing recent jobs

Webhooks

Key Exabase APIs used in this example

Technologies

Run it yourself

FAQ

Other example projects:

Cut your token spend and give your agent precise context.

Get started in minutes.

Cut your token spend and give your agent precise context.

Get started in minutes.

Cut your token spend and give your agent precise context.

Get started in minutes.

Part of the family:

Part of the family:

Part of the family: