Extracting data from multi-page PDFs
A 12-page invoice is not 12 documents — it is one document with fields scattered across pages and line items that overflow. Here is how dOCR reads the whole thing and returns one typed JSON object.
Mara Chen
ML Engineer
4 min read
The single-page case is easy to reason about: one image goes in, one set of fields comes out. Real documents are rarely that tidy. An invoice header lives on page 1, the line items spill across pages 3 and 4, and the totals and signature block sit on the last page. The naive approach — run each page independently and stitch the results together — produces duplicate headers, orphaned table rows, and totals that don’t reconcile.
dOCR treats a multi-page PDF as one document with one schema, not a folder of pages. Here is how that works, and the limits you should design around.
Why 15 pages and 10MB
Every page is rendered, passed through AI vision, and reasoned over by an LLM against the document type you chose. That work scales with page count and pixel count, and both have a ceiling where latency and accuracy stop being worth it. So the API enforces two hard limits: 15 pages and 10MB per request, across PDF, JPG/JPEG, PNG, BMP, WebP, and DOCX.
The page limit keeps a single extraction inside a predictable latency budget. The size limit mostly bites scanned PDFs — a 300 DPI color scan of a dozen pages can blow past 10MB on its own. The two limits are independent: a 14-page document that happens to weigh 11MB is still rejected, and a 3MB file with 18 pages is too.
One schema, all the pages
When dOCR receives a multi-page PDF, it does not score each page in isolation. The full document is in context together, which is what lets it answer the questions that single-page tools can’t:
- Fields that appear once, anywhere. An
invoiceNumberprinted only in the page-1 header is a top-level field, not something that repeats per page. dOCR finds it wherever it lives and returns it once. - Tables that span pages. Line items that start on page 3 and continue onto page 4 are read as one continuous array. The “continued…” row and the repeated column headers are recognized as table furniture, not data.
- Fields that depend on each other. A
totalon the last page can be reconciled against the line items it sums, because both are visible to the same pass.
The result is still what dOCR always returns — structured JSON, not raw OCR text. You define the shape; the document spans pages, the schema does not.
Calling it
There is nothing multi-page-specific in the request. You send the file and the document type, same as a single page:
curl -X POST https://docr.dev/api/v1/extract \
-H "Authorization: Bearer $DOCR_API_KEY" \
-F "file=@quarterly-invoice.pdf" \
-F "documentType=invoice"
Or from JavaScript:
const form = new FormData();
form.append('file', pdfBlob, 'quarterly-invoice.pdf');
form.append('documentType', 'invoice');
const res = await fetch('https://docr.dev/api/v1/extract', {
method: 'POST',
headers: { Authorization: `Bearer ${process.env.DOCR_API_KEY}` },
body: form,
});
const { data } = await res.json();
The response carries the header fields once and the repeating rows as an array — regardless of how many pages each one was spread across:
{
"data": {
"invoiceNumber": "INV-20418",
"issueDate": "2026-03-31",
"vendor": "Northwind Paper Co.",
"currency": "USD",
"lineItems": [
{ "description": "A4 copy paper, 80gsm", "quantity": 40, "unitPrice": 6.5, "amount": 260.0 },
{ "description": "Toner cartridge, black", "quantity": 6, "unitPrice": 89.0, "amount": 534.0 },
{ "description": "Binder clips, assorted", "quantity": 12, "unitPrice": 3.25, "amount": 39.0 }
],
"subtotal": 833.0,
"tax": 66.64,
"total": 899.64
}
}
Scanned vs. digital PDFs
Digital PDFs carry a real text layer, so the vision pass has clean glyphs to work with and accuracy is highest. Scanned PDFs are just images of pages — no text layer — and that is where the 10MB ceiling and image quality matter most.
A few things help with scans:
| Situation | What to do |
|---|---|
| Scan over 10MB | Re-export at 200–300 DPI; higher rarely improves accuracy |
| Document over 15 pages | Split into chunks of 15 or fewer and merge the JSON |
| Mixed orientation | Deskew or auto-rotate before upload; sideways pages read poorly |
| Faint or low-contrast scan | Scan in grayscale at higher contrast rather than color |
For born-digital files you generally don’t need to do anything — keep the original PDF rather than printing it back to an image, which throws the text layer away.
Tips for large documents
Split very large docs at logical boundaries. If you have a 40-page statement, don’t cut blindly every 15 pages — split where a section ends so a line-item table isn’t severed mid-row. Then concatenate the lineItems arrays from each response.
Choose the document type that matches the content, not the file. A PDF that bundles a cover letter, an invoice, and a receipt will extract cleanest if you target the dominant content with the right schema. If you find yourself wishing one schema covered three layouts, that is usually a sign you want two document types and two calls.
Start in the Dashboard, then move to the API. Upload a representative multi-page sample in the Dashboard, confirm the fields and the line-item array come back the way you expect, and only then wire up the endpoint. The path from dashboard to API is the same schema either way.
The mental model that keeps things simple: a multi-page PDF is one document, one schema, one JSON object. dOCR’s job is to make the page count invisible — you describe the fields you want, and the number of pages they were printed on stops being your problem.