Document Extraction Quickstart

Learn how to extract structured data from documents using the DocDigitizer Sync API.

Overview

The Document Extraction API transforms PDF documents into structured data through:

OCR - Convert images and scanned documents to text
Classification - Automatically detect document type (invoice, receipt, etc.)
Extraction - Pull out specific fields based on document type

Prerequisites

DocDigitizer account with API key
PDF document to process

Basic Extraction

Step 1: Upload Document

curl -X POST https://apix.docdigitizer.com/sync \
  -H "X-API-Key: your-api-key" \
  -F "[email protected]" \
  -F "id=$(uuidgen)" \
  -F "contextID=$(uuidgen)"

Step 2: Receive Extracted Data

{
  "StateText": "COMPLETED",
  "TraceId": "ABC1234",
  "Output": {
    "extractions": [
      {
        "documentType": "Invoice",
        "confidence": 0.95,
        "countryCode": "PT",
        "extraction": {
          "invoiceNumber": "INV-2024-001",
          "invoiceDate": "2024-01-15",
          "vendorName": "Acme Corp",
          "vendorNIF": "123456789",
          "totalAmount": 1250.00,
          "currency": "EUR",
          "lineItems": [
            {
              "description": "Product A",
              "quantity": 2,
              "unitPrice": 500.00,
              "total": 1000.00
            }
          ]
        }
      }
    ]
  }
}

Document Types

Invoice

Common fields extracted:

invoiceNumber - Invoice reference number
invoiceDate - Issue date
dueDate - Payment due date
vendorName - Seller/vendor name
vendorNIF / vendorVAT - Tax identification number
buyerName - Buyer/customer name
totalAmount - Total including tax
netAmount - Total before tax
taxAmount - Tax/VAT amount
currency - Currency code (EUR, USD, etc.)
lineItems - Array of items with description, quantity, price

Receipt

Common fields extracted:

merchantName - Store/merchant name
date - Transaction date
time - Transaction time
total - Total amount
paymentMethod - Cash, card, etc.
items - Purchased items

Contract

Common fields extracted:

parties - Contract parties
effectiveDate - Start date
expirationDate - End date
value - Contract value
terms - Key terms and conditions

Multi-Document PDFs

A single PDF might contain multiple documents. The API handles this automatically:

{
  "Output": {
    "extractions": [
      {
        "documentType": "Invoice",
        "pageRange": { "start": 1, "end": 2 },
        "extraction": { ... }
      },
      {
        "documentType": "Invoice",
        "pageRange": { "start": 3, "end": 4 },
        "extraction": { ... }
      }
    ]
  }
}

Confidence Scores

Each extraction includes a confidence score (0-1):

Score	Interpretation
0.9+	High confidence - reliable extraction
0.7-0.9	Good confidence - review recommended
0.5-0.7	Moderate confidence - manual review suggested
< 0.5	Low confidence - manual verification required

for extraction in result["Output"]["extractions"]:
    if extraction["confidence"] >= 0.9:
        # Auto-process
        process_extraction(extraction)
    elif extraction["confidence"] >= 0.7:
        # Queue for quick review
        queue_for_review(extraction)
    else:
        # Manual processing
        queue_for_manual(extraction)

Using Context IDs

The contextID parameter groups related documents:

# Process multiple documents from the same batch
batch_context = str(uuid.uuid4())

for pdf_file in pdf_files:
    response = requests.post(
        "https://apix.docdigitizer.com/sync",
        headers={"X-API-Key": API_KEY},
        files={"files": open(pdf_file, "rb")},
        data={
            "id": str(uuid.uuid4()),
            "contextID": batch_context  # Same for all documents in batch
        }
    )

Specifying a Pipeline

Use pipelineIdentifier to select a specific processing pipeline:

curl -X POST https://apix.docdigitizer.com/sync \
  -H "X-API-Key: your-api-key" \
  -H "X-DD-Pipeline: MainPipelineWithOCR" \
  -F "[email protected]" \
  -F "id=$(uuidgen)" \
  -F "contextID=$(uuidgen)"

Or in the form data:

data={
    "id": str(uuid.uuid4()),
    "contextID": str(uuid.uuid4()),
    "pipelineIdentifier": "MainPipelineWithOCR"
}

Timing Information

The response includes processing time breakdown:

{
  "Timers": {
    "DocIngester": {
      "total": 2345.67,
      "ForwardRequest": 2300.00,
      "DocWorker": {
        "total": 2250.00,
        "OCR": 1200.00,
        "Classification": 450.00,
        "Extraction": 600.00
      }
    }
  }
}

Control timer detail level with the X-DD-LogLevel header:

Value	Detail Level
`minimal`	Basic timing (default)
`medium`	Component-level timing
`full`	Detailed breakdown

Error Handling

Handle extraction errors gracefully:

result = response.json()

if result["StateText"] == "ERROR":
    trace_id = result["TraceId"]
    messages = result["Messages"]

    if "File must be a PDF" in messages:
        print("Please provide a valid PDF file")
    elif "Invalid or missing API key" in messages:
        print("Check your API key")
    else:
        print(f"Error: {messages}. Contact support with TraceId: {trace_id}")
else:
    # Process successful extraction
    process_extractions(result["Output"]["extractions"])

Best Practices

Use UUIDs for tracking - Generate unique IDs for each document
Group with contextID - Use the same contextID for related documents
Check confidence scores - Don't auto-process low-confidence extractions
Log TraceIds - Store TraceIds for debugging
Handle multi-doc PDFs - Process all extractions in the array
Implement retry logic - Retry on 5xx errors with backoff

Document Extraction Quickstart

Overview

Prerequisites

Basic Extraction

Step 1: Upload Document

Step 2: Receive Extracted Data

Document Types

Invoice

Receipt

Contract

Multi-Document PDFs

Confidence Scores

Using Context IDs

Specifying a Pipeline

Timing Information

Error Handling

Best Practices

Next Steps