Skip to content
Back to blog
engineering extraction

Structured JSON, not raw OCR text

Why dOCR returns typed, validated JSON instead of a wall of OCR text — and how vision plus an LLM guided by a document-type schema gets you fields you can trust.

Mara Chen

ML Engineer

4 min read

The first version of dOCR returned exactly what classic OCR gives you: a transcript. Every glyph on the page, top to bottom, as a single string. It demoed beautifully and was useless in production. Customers did not want the text on an invoice — they wanted the invoice number, the total, and the due date, as values their code could branch on. This post explains the three decisions behind why dOCR returns structured JSON, and how we produce it.

Where raw OCR falls down

Raw OCR moves the hard problem one inch and calls it solved. You no longer have pixels, but you have a string, and the field you actually need is still buried somewhere inside it. The work that remains is the work that matters:

ConcernRaw OCR textStructured extraction
OutputOne blob of text, reading orderTyped fields keyed by name
Locating a fieldYour regex / heuristicsDone for you by schema
Layout changesBreaks your parserAbsorbed by the model
TypesEverything is a stringnumber, date, boolean, arrays
Missing dataSilent — you get an empty matchExplicit null + confidence

The regex tax is the part teams underestimate. A vendor reissues their invoice template, the “Total” label moves into a sidebar, and the pattern that worked for two years quietly starts matching the subtotal instead. Nothing throws. You find out from a customer.

Decision one: extract against a schema, not into a string

A document type in dOCR is a field schema you define — the names, types, and descriptions of what you want back. Extraction is the act of filling that schema, not transcribing the page. (We go deep on authoring these in custom document types.)

That reframing changes the request. You are not asking “what does this say”; you are asking “give me these specific values from this document”:

curl -X POST https://docr.dev/api/v1/extract \
  -H "Authorization: Bearer $DOCR_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "documentType=invoice"

The same call from JavaScript, with no SDK to install:

const form = new FormData();
form.append('file', file); // a Blob or File
form.append('documentType', 'invoice');

const res = await fetch('https://docr.dev/api/v1/extract', {
  method: 'POST',
  headers: { Authorization: `Bearer ${process.env.DOCR_API_KEY}` },
  body: form,
});

const { data } = await res.json();

We accept PDF, JPG, PNG, BMP, WebP, and DOCX up to 10MB and 15 pages, so the schema is the only thing you write — not a per-format parser.

Decision two: vision plus an LLM, guided by the schema

Under the hood, dOCR does not OCR-then-parse. A vision model reads the document as a layout — it sees that a number sitting under a “Total” header in the bottom-right is the total, the same way you do — and an LLM maps what it sees onto your field schema. Position, proximity, and label semantics all become signal instead of noise.

This is why layout drift stops being your problem. When the “Total” label moves, the model still understands the relationship; it was never matching a fixed coordinate or a regex anchor. The schema tells it what to find, the vision-plus-LLM stage figures out where. The output is JSON shaped to your fields:

{
  "data": {
    "invoiceNumber": "INV-2026-0481",
    "issueDate": "2026-05-02",
    "dueDate": "2026-06-01",
    "total": 1284.5,
    "currency": "USD",
    "lineItems": [
      { "description": "Annual license", "quantity": 1, "amount": 1200 },
      { "description": "Onboarding", "quantity": 1, "amount": 84.5 }
    ]
  }
}

Note total is a number, not "$1,284.50". Stripping the currency symbol, the thousands separator, and parsing to a float is exactly the brittle glue we wanted to delete from your codebase. The same approach scales to long documents — see how we handle multi-page PDFs.

Decision three: make the JSON trustworthy

Structured output is only an upgrade if you can trust it. Three guarantees do that work.

Types are enforced. A field you declared as a number comes back as a number or as null — never as a half-parsed string. A date is normalized to ISO YYYY-MM-DD regardless of whether the document wrote it 05/02/26 or 2 May 2026.

Missing is explicit. If a field is not present in the document, you get null, not an empty string or a wrong guess. A receipt with no tax line returns "tax": null, and your code can decide what that means instead of silently treating it as zero.

Confidence is attached. Every field carries a score so you can route the uncertain ones to a human:

const { data, confidence } = await res.json();

if (confidence.total < 0.85) {
  await queueForReview(data); // don't auto-post a shaky number to the ledger
}

In our own benchmarks across roughly 40,000 documents, fields we returned with confidence ≥ 0.9 matched ground truth 99.1% of the time. That threshold is the line between “post it automatically” and “ask a person” — and having it as a number, per field, is what lets teams automate the easy 90% without gambling on the hard 10%.

What we’d tell you to steal

If you are building extraction yourself: stop treating OCR as the deliverable — it is an intermediate representation, and the parsing that follows is where reliability goes to die. Define what you want as a typed schema and extract into it. Let a vision model handle layout so a template change is the model’s problem, not your on-call’s. And make uncertainty a first-class value — a null for missing and a confidence score per field — so downstream code can make decisions instead of assumptions. If you would rather not build any of that, going from the dashboard to the API takes about an afternoon, and the docs have the rest.