Document Extraction Quickstart
Document Extraction Quickstart
Learn how to extract structured data from documents using the DocDigitizer Sync API.
Overview
The Document Extraction API transforms PDF documents into structured data through:
- OCR - Convert images and scanned documents to text
- Classification - Automatically detect document type (invoice, receipt, etc.)
- Extraction - Pull out specific fields based on document type
Prerequisites
- DocDigitizer account with API key
- PDF document to process
Basic Extraction
Step 1: Upload Document
curl -X POST https://apix.docdigitizer.com/sync \
-H "X-API-Key: your-api-key" \
-F "[email protected]" \
-F "id=$(uuidgen)" \
-F "contextID=$(uuidgen)"Step 2: Receive Extracted Data
{
"StateText": "COMPLETED",
"TraceId": "ABC1234",
"Output": {
"extractions": [
{
"documentType": "Invoice",
"confidence": 0.95,
"countryCode": "PT",
"extraction": {
"invoiceNumber": "INV-2024-001",
"invoiceDate": "2024-01-15",
"vendorName": "Acme Corp",
"vendorNIF": "123456789",
"totalAmount": 1250.00,
"currency": "EUR",
"lineItems": [
{
"description": "Product A",
"quantity": 2,
"unitPrice": 500.00,
"total": 1000.00
}
]
}
}
]
}
}Document Types
Invoice
Common fields extracted:
invoiceNumber- Invoice reference numberinvoiceDate- Issue datedueDate- Payment due datevendorName- Seller/vendor namevendorNIF/vendorVAT- Tax identification numberbuyerName- Buyer/customer nametotalAmount- Total including taxnetAmount- Total before taxtaxAmount- Tax/VAT amountcurrency- Currency code (EUR, USD, etc.)lineItems- Array of items with description, quantity, price
Receipt
Common fields extracted:
merchantName- Store/merchant namedate- Transaction datetime- Transaction timetotal- Total amountpaymentMethod- Cash, card, etc.items- Purchased items
Contract
Common fields extracted:
parties- Contract partieseffectiveDate- Start dateexpirationDate- End datevalue- Contract valueterms- Key terms and conditions
Multi-Document PDFs
A single PDF might contain multiple documents. The API handles this automatically:
{
"Output": {
"extractions": [
{
"documentType": "Invoice",
"pageRange": { "start": 1, "end": 2 },
"extraction": { ... }
},
{
"documentType": "Invoice",
"pageRange": { "start": 3, "end": 4 },
"extraction": { ... }
}
]
}
}Confidence Scores
Each extraction includes a confidence score (0-1):
| Score | Interpretation |
|---|---|
| 0.9+ | High confidence - reliable extraction |
| 0.7-0.9 | Good confidence - review recommended |
| 0.5-0.7 | Moderate confidence - manual review suggested |
| < 0.5 | Low confidence - manual verification required |
for extraction in result["Output"]["extractions"]:
if extraction["confidence"] >= 0.9:
# Auto-process
process_extraction(extraction)
elif extraction["confidence"] >= 0.7:
# Queue for quick review
queue_for_review(extraction)
else:
# Manual processing
queue_for_manual(extraction)Using Context IDs
The contextID parameter groups related documents:
# Process multiple documents from the same batch
batch_context = str(uuid.uuid4())
for pdf_file in pdf_files:
response = requests.post(
"https://apix.docdigitizer.com/sync",
headers={"X-API-Key": API_KEY},
files={"files": open(pdf_file, "rb")},
data={
"id": str(uuid.uuid4()),
"contextID": batch_context # Same for all documents in batch
}
)Specifying a Pipeline
Use pipelineIdentifier to select a specific processing pipeline:
curl -X POST https://apix.docdigitizer.com/sync \
-H "X-API-Key: your-api-key" \
-H "X-DD-Pipeline: MainPipelineWithOCR" \
-F "[email protected]" \
-F "id=$(uuidgen)" \
-F "contextID=$(uuidgen)"Or in the form data:
data={
"id": str(uuid.uuid4()),
"contextID": str(uuid.uuid4()),
"pipelineIdentifier": "MainPipelineWithOCR"
}Timing Information
The response includes processing time breakdown:
{
"Timers": {
"DocIngester": {
"total": 2345.67,
"ForwardRequest": 2300.00,
"DocWorker": {
"total": 2250.00,
"OCR": 1200.00,
"Classification": 450.00,
"Extraction": 600.00
}
}
}
}Control timer detail level with the X-DD-LogLevel header:
| Value | Detail Level |
|---|---|
minimal | Basic timing (default) |
medium | Component-level timing |
full | Detailed breakdown |
Error Handling
Handle extraction errors gracefully:
result = response.json()
if result["StateText"] == "ERROR":
trace_id = result["TraceId"]
messages = result["Messages"]
if "File must be a PDF" in messages:
print("Please provide a valid PDF file")
elif "Invalid or missing API key" in messages:
print("Check your API key")
else:
print(f"Error: {messages}. Contact support with TraceId: {trace_id}")
else:
# Process successful extraction
process_extractions(result["Output"]["extractions"])Best Practices
- Use UUIDs for tracking - Generate unique IDs for each document
- Group with contextID - Use the same contextID for related documents
- Check confidence scores - Don't auto-process low-confidence extractions
- Log TraceIds - Store TraceIds for debugging
- Handle multi-doc PDFs - Process all extractions in the array
- Implement retry logic - Retry on 5xx errors with backoff
Next Steps
Updated 15 days ago
