Examples
Build a Document Extraction Tool with Exabase
Extract structured data from any URL or file with a single API call.
This tutorial walks through how the Simple Extraction Demo example uses the Exabase Extraction API to turn documents, web pages, images, audio, and video into clean, structured text and metadata, all without writing a single parser.
What it does
You submit a URL or drop a file. The app sends it to the Exabase Extraction API, polls until the job finishes, then displays everything that came back: metadata, auto-generated captions and keywords, document details, paginated text chunks, and downloadable attachments. A sidebar lists recent jobs so you can revisit past extractions, and a webhook panel lets you configure a delivery URL and inspect or re-fire individual deliveries.
This is the simplest possible integration with Exabase Extract. One API call in, structured data out. The example exists to show what the extraction response looks like and how to work with each piece of it.
What is Exabase Extract?
Exabase Extract turns any source into structured data with a single API call. Send a URL, file, image, audio, or video and get clean JSON back with tables, metadata, and text hierarchy intact. No parsers to write, no edge cases to handle, no post-processing pipeline to maintain.
How Exabase fits in
Submitting a source
The app accepts two kinds of input. For URLs:
ts
For uploaded files:
ts
Both return a job object immediately. Extraction runs asynchronously, so the app needs to poll for completion.
Polling until done
The client uses React Query's refetchInterval to check the job state every 1.5 seconds, stopping once it hits a terminal state (completed, failed, or cleaned):
ts
On the server side, each poll is a single SDK call:
ts
What comes back
Once a job completes, the extraction response includes several layers of data depending on the source type.
Common fields are always present: MIME type, file size, an auto-generated caption, keywords, a thumbnail, and a chunk count.
Document fields appear for PDFs and similar files: page count, title, author, subject, and a rendered PDF link.
Media fields appear for images, audio, and video: dimensions, duration, codec, OCR text, a transcript link, and a screenshot link.
Web fields appear for URLs: page title, site name, description, favicon, an image, and links to the raw HTML and a reader-mode view.
The app renders all of this in a single result card, organized by category. You don't have to use every field; just pick the ones relevant to your use case.
Paginating text chunks
Extracted text comes back in paginated chunks. The app fetches them in windows of 20 and lets the user load more on demand:
ts
Each chunk includes a sequence number and, depending on the source, a pageNumber (for documents) or timeStart/timeEnd (for media with transcripts).
Downloading attachments
If the extraction produced attachments (embedded images, tables, etc.), the app offers a one-click ZIP download:
ts
This is served through a Next.js route handler that streams the ZIP directly to the browser.
Reprocessing failed jobs
If a job fails (bad URL, transient error), the app shows a Reprocess button that retries extraction on the same source:
ts
The job returns to a pending state and the polling loop picks it back up automatically.
Browsing recent jobs
The sidebar lists the 25 most recent extraction jobs using cursor-based pagination:
ts
Each entry shows the job name (or URL, or ID), its state badge, and is clickable to load the full result.
Webhooks
The app includes a full webhook integration. You configure a delivery URL through the settings panel:
ts
Once configured, Exabase POSTs extraction events to that URL. The app can list delivery attempts for any job, inspect the request payload and response body for each attempt, and manually trigger a re-delivery:
ts
The webhook logs panel auto-refreshes while any delivery is in a pending or retrying state.
Key Exabase APIs used in this example
API | Purpose |
|---|---|
| Submit a URL for extraction |
| Submit an uploaded file for extraction |
| Poll job state until completed or failed |
| Paginate through extracted text chunks |
| List recent extraction jobs |
| Retry a failed extraction |
| Download all attachments as a ZIP |
| Read the current webhook URL |
| Set or clear the webhook URL |
| List delivery attempts for a job |
| Inspect a single delivery attempt |
| Manually re-fire a webhook for a job |
Technologies
Exabase: Extraction API for submitting URLs and files, polling job state, reading chunks, downloading attachments, and managing webhook settings
Next.js: Front-end and back-end
tRPC: Type-safe API layer between client and server
shadcn/ui: Accessible UI components
Tailwind CSS: Styling
Biome: Lint and format
Run it yourself
bash
Add EXABASE_API_KEY to .env.local. Open http://localhost:3000 and submit a URL or drop a file.
FAQ
What can I extract from?
Anything Exabase Extract supports: PDFs, DOCX, images, audio, video, and web URLs. The demo handles all of them through the same two entry points (create for URLs, createFromFile for uploads).
How long does extraction take?
Small documents and simple web pages finish in a few seconds. Larger files, media with transcripts, or pages that need rendering take longer. The polling loop handles the wait automatically.
What are the attachments in the ZIP download?
These are embedded assets that Exabase extracts from the source: images, tables rendered as separate files, and similar artifacts. Not every source produces attachments.
Do I need webhooks?
No. Polling works fine for interactive use. Webhooks are useful when you want to trigger downstream processing (indexing, notifications, pipelines) without keeping a connection open. The demo includes both patterns so you can see how each works.
Could I use this as a starting point for something bigger?
That's the idea. The extraction pipeline here (submit, poll, read chunks, download attachments) is the same regardless of what you build on top of it. Swap the result viewer for a search indexer, a summarization step, a content migration tool, or anything else that needs clean text out of messy sources. The Exabase integration stays the same.