Document Extraction Quickstart
Document Extraction Quickstart
Learn how to extract structured data from documents using the DocDigitizer API.
Overview
The Document Extraction API transforms PDF or image documents into structured data through:
- OCR - Convert images and scanned documents to text
- Classification - Automatically detect document type and schema (invoice, receipt, etc.)
- Extraction - Pull out specific fields based on schema detected
Prerequisites
- DocDigitizer account with API key
- PDF or image document to process
Basic Extraction
Step 1: Upload Document
curl --request POST \
--url https://api.docdigitizer.com/v3/docingester/extract \
--header 'X-API-Key: your-api-key-here' \
--header 'accept: application/json' \
--header 'content-type: multipart/form-data' \
--form files='@invoice.pdf'Step 2: Receive Extracted Data
{
"stateText": "COMPLETED",
"traceId": "ABC1234",
"pipeline": "MainPipelineWithOCR",
"numberPages": 2,
"output": {
"extractions": [
{
"schemaName": "Invoice",
"confidence": 0.95,
"pages": [
1,
2
],
"extraction": {
"invoiceNumber": "INV-2024-001",
"invoiceDate": "2024-01-15",
"vendorName": "Acme Corp",
"vendorNIF": "123456789",
"totalAmount": 1250.00,
"currency": "EUR",
"lineItems": [
{
"description": "Product A",
"quantity": 2,
"unitPrice": 500.00,
"total": 1000.00
}
]
}
}
]
},
"DocumentId": "f8b969cd-fbf5-4048-8d66-ea90283d4eb9",
"Timestamp": "2026-05-04T14:44:43.3512947Z",
"timers": {
"DocIngester": {
"Total": 2345.67
}
}
}Document Types (examples)
Invoice
Common fields extracted:
invoiceNumber- Invoice reference numberinvoiceDate- Issue datedueDate- Payment due datevendorName- Seller/vendor namevendorNIF/vendorVAT- Tax identification numberbuyerName- Buyer/customer nametotalAmount- Total including taxnetAmount- Total before taxtaxAmount- Tax/VAT amountcurrency- Currency code (EUR, USD, etc.)lineItems- Array of items with description, quantity, price
Receipt
Common fields extracted:
merchantName- Store/merchant namedate- Transaction datetime- Transaction timetotal- Total amountpaymentMethod- Cash, card, etc.items- Purchased items
Contract
Common fields extracted:
parties- Contract partieseffectiveDate- Start dateexpirationDate- End datevalue- Contract valueterms- Key terms and conditions
Multi-Document PDFs
A single PDF might contain multiple documents. The API handles this automatically:
{
"output": {
"extractions": [
{
"schemaName": "Invoice",
"pages": [1, 2],
"extraction": { ... }
},
{
"schemaName": "Receipt",
"pages": [3, 4],
"extraction": { ... }
}
]
}
}Confidence Scores
Each extraction includes a confidence score (0-1):
| Score | Interpretation |
|---|---|
| 0.9+ | High confidence - reliable extraction |
| 0.7-0.9 | Good confidence - review recommended |
| 0.5-0.7 | Moderate confidence - manual review suggested |
| < 0.5 | Low confidence - manual verification required |
for extraction in result["output"]["extractions"]:
if extraction["confidence"] >= 0.9:
# Auto-process
process_extraction(extraction)
elif extraction["confidence"] >= 0.7:
# Queue for quick review
queue_for_review(extraction)
else:
# Manual processing
queue_for_manual(extraction)Timing Information
The response includes processing time breakdown:
{
"timers": {
"DocIngester": {
"Total": 2345.67,
"ForwardRequest": 2300.00,
"DocWorker": {
"total": 2250.00,
"OCR": 1200.00,
"Classification": 450.00,
"Extraction": 600.00
}
}
}
}Control timer detail level with the X-DD-LogLevel header:
| Value | Detail Level |
|---|---|
minimal | Basic timing (default) |
medium | Component-level timing |
full | Detailed breakdown |
Error Handling
Handle extraction errors gracefully:
result = response.json()
if result["stateText"] == "ERROR":
trace_id = result["traceId"]
messages = result["messages"]
if "File must be a PDF" in messages:
print("Please provide a valid PDF file")
elif "Invalid or missing API key" in messages:
print("Check your API key")
else:
print(f"Error: {messages}. Contact support with TraceId: {trace_id}")
else:
# Process successful extraction
process_extractions(result["Output"]["extractions"])Best Practices
- Check confidence scores - Don't auto-process low-confidence extractions
- Log TraceIds - Store TraceIds for debugging
- Handle multi-doc PDFs - Process all extractions in the array
- Implement retry logic - Retry on 5xx errors with backoff
Next Steps
Updated 14 days ago
