Examples
Build a Document Diff Viewer with Exabase
Compare any two documents (URLs, PDFs, uploads) and see exactly what changed. This tutorial walks through how the Document Diff Viewer example uses the Exabase Extraction API to turn raw files into structured text, then renders a word-level diff in the browser.
What it does
You enter two URLs or drop two files. The app submits both to the Exabase Extraction API concurrently, polls until each job finishes, then shows a side-by-side metadata comparison and a word-level text diff highlighting what changed between the two documents.
The interesting part is that there's no parsing code. No PDF library, no HTML scraper, no DOCX reader. Exabase Extract handles all of that – you send it a source and get clean, structured text back. The app just diffs the output.
What is Exabase Extract?
Exabase Extract turns any source into structured data with a single API call. Send a URL, file, image, audio, or video and get clean JSON back with tables, metadata, and text hierarchy intact. No parsers to write, no edge cases to handle, no post-processing pipeline to maintain.
How Exabase fits in
Submitting documents for extraction
The app supports two input modes: URLs and file uploads. Both go through the same Exabase Extract API and return a job ID.
For URLs:
ts
For uploaded files:
ts
Each call kicks off an asynchronous extraction job. The API doesn't block — it returns immediately with a job object you can poll.
Polling job state
Extraction isn't instant (especially for large documents or URLs that need rendering), so the app polls each job independently until it reaches a terminal state:
ts
On the client side, the polling is handled by React Query's refetchInterval — it checks every 1.5 seconds and stops once the job is completed or failed:
ts
Paginating through text chunks
Once both jobs complete, the app fetches the extracted text. Exabase returns text in paginated chunks (the app uses a window of 20), so it walks through all pages to assemble the full document text:
ts
Each chunk includes a sequence number, optional pageNumber (for documents), and optional timeStart/timeEnd (for media). The diff viewer only needs the text.
Computing the diff
With both documents' text assembled, the actual diffing is trivial — a single call to the diff library:
ts
The heavy lifting was getting clean, comparable text out of two arbitrary sources. That's the part Exabase handles.
Rich metadata for free
Extraction doesn't just return text. Depending on the source type, you also get structured metadata – page count, author, title, MIME type, keywords, captions, dimensions, duration, and more. The app shows a side-by-side metadata comparison panel above the diff so you can see at a glance how the two documents differ structurally, not just textually.
The metadata is organized into three optional groups depending on the source kind: document (pages, author, title, subject), media (width, height, duration, codec, OCR text), and web (site name, description, favicon, reader view).
Key Exabase APIs used in this example
API | Purpose |
|---|---|
| Submit a URL for extraction |
| Submit an uploaded file for extraction |
| Poll job state until completed or failed |
| Paginate through extracted text chunks |
Technologies
Exabase: Extraction API for submitting URLs and files, polling job state, and reading text chunks
Next.js: Front-end and back-end
tRPC: Type-safe API layer between client and server
shadcn/ui: Accessible UI components
Tailwind CSS: Styling
diff: Word-level text diffing
Run it yourself
bash
Add EXABASE_API_KEY to .env.local. Open http://localhost:3000, enter two URLs or drop two files, and click Compare.
FAQ
Why use Exabase Extract instead of parsing documents yourself?
Document parsing is a deceptively deep problem. PDFs alone have dozens of edge cases – scanned pages, embedded fonts, multi-column layouts, tables. Add HTML, DOCX, images, and audio and you're maintaining a parser zoo. Exabase Extract handles all of that behind a single API call, so you can focus on what you're actually building – in this case, a diff viewer.
What file types does this support?
Anything Exabase Extract supports: PDFs, DOCX, images, audio, video, and web URLs. The diff viewer doesn't care about the source format – it only sees the extracted text. You could compare a PDF against a web page and it would work.
How long does extraction take?
It depends on the source. A small PDF might finish in a couple of seconds. A large document or a URL that needs rendering takes longer. The app polls automatically and shows a progress badge for each side, so you're never left guessing.
Can I compare more than two documents?
The example is built for pairwise comparison. But since the Exabase Extract API is stateless per job, you could extend the UI to run multiple comparisons or build a matrix view — the API calls would be the same.
Could I use this pattern for something other than diffing?
The Exabase extraction pipeline is general-purpose. This example uses it for diffing, but the same submit → poll → read-chunks flow works for summarization, search indexing, content migration, compliance checks, or anything else where you need clean text out of messy documents. The extraction side is completely decoupled from what you do with the output.