Examples

Build a Document Extraction Tool with Exabase

Extract structured data from any URL or file with a single API call.

This tutorial walks through how the Simple Extraction Demo example uses the Exabase Extraction API to turn documents, web pages, images, audio, and video into clean, structured text and metadata, all without writing a single parser.


What it does

You submit a URL or drop a file. The app sends it to the Exabase Extraction API, polls until the job finishes, then displays everything that came back: metadata, auto-generated captions and keywords, document details, paginated text chunks, and downloadable attachments. A sidebar lists recent jobs so you can revisit past extractions, and a webhook panel lets you configure a delivery URL and inspect or re-fire individual deliveries.

This is the simplest possible integration with Exabase Extract. One API call in, structured data out. The example exists to show what the extraction response looks like and how to work with each piece of it.


What is Exabase Extract?

Exabase Extract turns any source into structured data with a single API call. Send a URL, file, image, audio, or video and get clean JSON back with tables, metadata, and text hierarchy intact. No parsers to write, no edge cases to handle, no post-processing pipeline to maintain.


How Exabase fits in

Submitting a source

The app accepts two kinds of input. For URLs:

ts

const job = await getExabase().extract.create({ url, name });
const job = await getExabase().extract.create({ url, name });
const job = await getExabase().extract.create({ url, name });


For uploaded files:

ts

const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });
const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });
const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });

Both return a job object immediately. Extraction runs asynchronously, so the app needs to poll for completion.


Polling until done

The client uses React Query's refetchInterval to check the job state every 1.5 seconds, stopping once it hits a terminal state (completed, failed, or cleaned):

ts

const jobQuery = trpc.extract.get.useQuery(
  { jobId: activeJobId ?? "" },
  {
    enabled: activeJobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);
const jobQuery = trpc.extract.get.useQuery(
  { jobId: activeJobId ?? "" },
  {
    enabled: activeJobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);
const jobQuery = trpc.extract.get.useQuery(
  { jobId: activeJobId ?? "" },
  {
    enabled: activeJobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);


On the server side, each poll is a single SDK call:

ts

const job = await getExabase().extract.get({ jobId });
const job = await getExabase().extract.get({ jobId });
const job = await getExabase().extract.get({ jobId });


What comes back

Once a job completes, the extraction response includes several layers of data depending on the source type.

Common fields are always present: MIME type, file size, an auto-generated caption, keywords, a thumbnail, and a chunk count.

Document fields appear for PDFs and similar files: page count, title, author, subject, and a rendered PDF link.

Media fields appear for images, audio, and video: dimensions, duration, codec, OCR text, a transcript link, and a screenshot link.

Web fields appear for URLs: page title, site name, description, favicon, an image, and links to the raw HTML and a reader-mode view.

The app renders all of this in a single result card, organized by category. You don't have to use every field; just pick the ones relevant to your use case.


Paginating text chunks

Extracted text comes back in paginated chunks. The app fetches them in windows of 20 and lets the user load more on demand:

ts

const CHUNK_WINDOW = 20;

const res = await getExabase().extract.getChunks({
  jobId,
  start,
  end: start + CHUNK_WINDOW - 1,
});
const CHUNK_WINDOW = 20;

const res = await getExabase().extract.getChunks({
  jobId,
  start,
  end: start + CHUNK_WINDOW - 1,
});
const CHUNK_WINDOW = 20;

const res = await getExabase().extract.getChunks({
  jobId,
  start,
  end: start + CHUNK_WINDOW - 1,
});

Each chunk includes a sequence number and, depending on the source, a pageNumber (for documents) or timeStart/timeEnd (for media with transcripts).


Downloading attachments

If the extraction produced attachments (embedded images, tables, etc.), the app offers a one-click ZIP download:

ts

const blob = await getExabase().extract.downloadAttachments({ jobId });
const blob = await getExabase().extract.downloadAttachments({ jobId });
const blob = await getExabase().extract.downloadAttachments({ jobId });

This is served through a Next.js route handler that streams the ZIP directly to the browser.


Reprocessing failed jobs

If a job fails (bad URL, transient error), the app shows a Reprocess button that retries extraction on the same source:

ts

await getExabase().extract.reprocess({ jobId });
await getExabase().extract.reprocess({ jobId });
await getExabase().extract.reprocess({ jobId });

The job returns to a pending state and the polling loop picks it back up automatically.


Browsing recent jobs

The sidebar lists the 25 most recent extraction jobs using cursor-based pagination:

ts

const res = await getExabase().extract.list({ limit: 25, cursor });
const res = await getExabase().extract.list({ limit: 25, cursor });
const res = await getExabase().extract.list({ limit: 25, cursor });

Each entry shows the job name (or URL, or ID), its state badge, and is clickable to load the full result.


Webhooks

The app includes a full webhook integration. You configure a delivery URL through the settings panel:

ts

await getExabase().extractSettings.update({ webhookUrl });
await getExabase().extractSettings.update({ webhookUrl });
await getExabase().extractSettings.update({ webhookUrl });


Once configured, Exabase POSTs extraction events to that URL. The app can list delivery attempts for any job, inspect the request payload and response body for each attempt, and manually trigger a re-delivery:

ts

// List deliveries for a job
const logs = await getExabase().extract.listWebhookLogs({ jobId });

// Inspect a specific delivery
const detail = await getExabase().extract.getWebhookLog({ jobId, logId });

// Re-fire
const { traceId } = await getExabase().extract.triggerWebhook({ jobId });
// List deliveries for a job
const logs = await getExabase().extract.listWebhookLogs({ jobId });

// Inspect a specific delivery
const detail = await getExabase().extract.getWebhookLog({ jobId, logId });

// Re-fire
const { traceId } = await getExabase().extract.triggerWebhook({ jobId });
// List deliveries for a job
const logs = await getExabase().extract.listWebhookLogs({ jobId });

// Inspect a specific delivery
const detail = await getExabase().extract.getWebhookLog({ jobId, logId });

// Re-fire
const { traceId } = await getExabase().extract.triggerWebhook({ jobId });

The webhook logs panel auto-refreshes while any delivery is in a pending or retrying state.


Key Exabase APIs used in this example

API

Purpose

extract.create

Submit a URL for extraction

extract.createFromFile

Submit an uploaded file for extraction

extract.get

Poll job state until completed or failed

extract.getChunks

Paginate through extracted text chunks

extract.list

List recent extraction jobs

extract.reprocess

Retry a failed extraction

extract.downloadAttachments

Download all attachments as a ZIP

extractSettings.get

Read the current webhook URL

extractSettings.update

Set or clear the webhook URL

extract.listWebhookLogs

List delivery attempts for a job

extract.getWebhookLog

Inspect a single delivery attempt

extract.triggerWebhook

Manually re-fire a webhook for a job


Technologies

  • Exabase: Extraction API for submitting URLs and files, polling job state, reading chunks, downloading attachments, and managing webhook settings

  • Next.js: Front-end and back-end

  • tRPC: Type-safe API layer between client and server

  • shadcn/ui: Accessible UI components

  • Tailwind CSS: Styling

  • Biome: Lint and format


Run it yourself

bash

git clone https://github.com/futurebrowser/exabase-examples.git
cd

git clone https://github.com/futurebrowser/exabase-examples.git
cd

git clone https://github.com/futurebrowser/exabase-examples.git
cd

Add EXABASE_API_KEY to .env.local. Open http://localhost:3000 and submit a URL or drop a file.


FAQ

What can I extract from?

Anything Exabase Extract supports: PDFs, DOCX, images, audio, video, and web URLs. The demo handles all of them through the same two entry points (create for URLs, createFromFile for uploads).


How long does extraction take?

Small documents and simple web pages finish in a few seconds. Larger files, media with transcripts, or pages that need rendering take longer. The polling loop handles the wait automatically.


What are the attachments in the ZIP download?

These are embedded assets that Exabase extracts from the source: images, tables rendered as separate files, and similar artifacts. Not every source produces attachments.


Do I need webhooks?

No. Polling works fine for interactive use. Webhooks are useful when you want to trigger downstream processing (indexing, notifications, pipelines) without keeping a connection open. The demo includes both patterns so you can see how each works.


Could I use this as a starting point for something bigger?

That's the idea. The extraction pipeline here (submit, poll, read chunks, download attachments) is the same regardless of what you build on top of it. Swap the result viewer for a search indexer, a summarization step, a content migration tool, or anything else that needs clean text out of messy sources. The Exabase integration stays the same.