Examples

Build a Document Diff Viewer with Exabase

Compare any two documents (URLs, PDFs, uploads) and see exactly what changed. This tutorial walks through how the Document Diff Viewer example uses the Exabase Extraction API to turn raw files into structured text, then renders a word-level diff in the browser.


What it does

You enter two URLs or drop two files. The app submits both to the Exabase Extraction API concurrently, polls until each job finishes, then shows a side-by-side metadata comparison and a word-level text diff highlighting what changed between the two documents.

The interesting part is that there's no parsing code. No PDF library, no HTML scraper, no DOCX reader. Exabase Extract handles all of that – you send it a source and get clean, structured text back. The app just diffs the output.


What is Exabase Extract?

Exabase Extract turns any source into structured data with a single API call. Send a URL, file, image, audio, or video and get clean JSON back with tables, metadata, and text hierarchy intact. No parsers to write, no edge cases to handle, no post-processing pipeline to maintain.


How Exabase fits in

Submitting documents for extraction

The app supports two input modes: URLs and file uploads. Both go through the same Exabase Extract API and return a job ID.

For URLs:

ts

const job = await getExabase().extract.create({ url, name });
const job = await getExabase().extract.create({ url, name });
const job = await getExabase().extract.create({ url, name });


For uploaded files:

ts

const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });
const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });
const buffer = Buffer.from(await file.arrayBuffer());
const job = await getExabase().extract.createFromFile({ file: buffer, name });

Each call kicks off an asynchronous extraction job. The API doesn't block — it returns immediately with a job object you can poll.


Polling job state

Extraction isn't instant (especially for large documents or URLs that need rendering), so the app polls each job independently until it reaches a terminal state:

ts

const job = await getExabase().extract.get({ jobId });
const job = await getExabase().extract.get({ jobId });
const job = await getExabase().extract.get({ jobId });


On the client side, the polling is handled by React Query's refetchInterval — it checks every 1.5 seconds and stops once the job is completed or failed:

ts

const jobQuery = trpc.extract.get.useQuery(
  { jobId: jobId ?? "" },
  {
    enabled: jobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);
const jobQuery = trpc.extract.get.useQuery(
  { jobId: jobId ?? "" },
  {
    enabled: jobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);
const jobQuery = trpc.extract.get.useQuery(
  { jobId: jobId ?? "" },
  {
    enabled: jobId !== null,
    refetchInterval: (query) => {
      const state = query.state.data?.data.state ?? null;
      return isTerminal(state) ? false : 1500;
    },
  },
);


Paginating through text chunks

Once both jobs complete, the app fetches the extracted text. Exabase returns text in paginated chunks (the app uses a window of 20), so it walks through all pages to assemble the full document text:

ts

const CHUNK_WINDOW = 20;

async function getAllChunkText(jobId: string): Promise<string> {
  const texts: string[] = [];
  let start = 1;
  while (true) {
    const page = await getChunks(jobId, start);
    for (const chunk of page.items) {
      texts.push(chunk.text);
    }
    if (page.items.length < CHUNK_WINDOW) break;
    start += CHUNK_WINDOW;
  }
  return texts.join("\n");
}
const CHUNK_WINDOW = 20;

async function getAllChunkText(jobId: string): Promise<string> {
  const texts: string[] = [];
  let start = 1;
  while (true) {
    const page = await getChunks(jobId, start);
    for (const chunk of page.items) {
      texts.push(chunk.text);
    }
    if (page.items.length < CHUNK_WINDOW) break;
    start += CHUNK_WINDOW;
  }
  return texts.join("\n");
}
const CHUNK_WINDOW = 20;

async function getAllChunkText(jobId: string): Promise<string> {
  const texts: string[] = [];
  let start = 1;
  while (true) {
    const page = await getChunks(jobId, start);
    for (const chunk of page.items) {
      texts.push(chunk.text);
    }
    if (page.items.length < CHUNK_WINDOW) break;
    start += CHUNK_WINDOW;
  }
  return texts.join("\n");
}

Each chunk includes a sequence number, optional pageNumber (for documents), and optional timeStart/timeEnd (for media). The diff viewer only needs the text.


Computing the diff

With both documents' text assembled, the actual diffing is trivial — a single call to the diff library:

ts

import { diffWords } from "diff";

function computeWordDiff(a: string, b: string): WordChange[] {
  return diffWords(a, b);
}
import { diffWords } from "diff";

function computeWordDiff(a: string, b: string): WordChange[] {
  return diffWords(a, b);
}
import { diffWords } from "diff";

function computeWordDiff(a: string, b: string): WordChange[] {
  return diffWords(a, b);
}

The heavy lifting was getting clean, comparable text out of two arbitrary sources. That's the part Exabase handles.


Rich metadata for free

Extraction doesn't just return text. Depending on the source type, you also get structured metadata – page count, author, title, MIME type, keywords, captions, dimensions, duration, and more. The app shows a side-by-side metadata comparison panel above the diff so you can see at a glance how the two documents differ structurally, not just textually.

The metadata is organized into three optional groups depending on the source kind: document (pages, author, title, subject), media (width, height, duration, codec, OCR text), and web (site name, description, favicon, reader view).


Key Exabase APIs used in this example

API

Purpose

extract.create

Submit a URL for extraction

extract.createFromFile

Submit an uploaded file for extraction

extract.get

Poll job state until completed or failed

extract.getChunks

Paginate through extracted text chunks


Technologies

  • Exabase: Extraction API for submitting URLs and files, polling job state, and reading text chunks

  • Next.js: Front-end and back-end

  • tRPC: Type-safe API layer between client and server

  • shadcn/ui: Accessible UI components

  • Tailwind CSS: Styling

  • diff: Word-level text diffing


Run it yourself

bash

git clone https://github.com/futurebrowser/exabase-examples.git
cd

git clone https://github.com/futurebrowser/exabase-examples.git
cd

git clone https://github.com/futurebrowser/exabase-examples.git
cd

Add EXABASE_API_KEY to .env.local. Open http://localhost:3000, enter two URLs or drop two files, and click Compare.


FAQ

Why use Exabase Extract instead of parsing documents yourself?

Document parsing is a deceptively deep problem. PDFs alone have dozens of edge cases – scanned pages, embedded fonts, multi-column layouts, tables. Add HTML, DOCX, images, and audio and you're maintaining a parser zoo. Exabase Extract handles all of that behind a single API call, so you can focus on what you're actually building – in this case, a diff viewer.


What file types does this support?

Anything Exabase Extract supports: PDFs, DOCX, images, audio, video, and web URLs. The diff viewer doesn't care about the source format – it only sees the extracted text. You could compare a PDF against a web page and it would work.


How long does extraction take?

It depends on the source. A small PDF might finish in a couple of seconds. A large document or a URL that needs rendering takes longer. The app polls automatically and shows a progress badge for each side, so you're never left guessing.


Can I compare more than two documents?

The example is built for pairwise comparison. But since the Exabase Extract API is stateless per job, you could extend the UI to run multiple comparisons or build a matrix view — the API calls would be the same.


Could I use this pattern for something other than diffing?

The Exabase extraction pipeline is general-purpose. This example uses it for diffing, but the same submit → poll → read-chunks flow works for summarization, search indexing, content migration, compliance checks, or anything else where you need clean text out of messy documents. The extraction side is completely decoupled from what you do with the output.