Document Extraction Quickstart

Document Extraction Quickstart

Learn how to extract structured data from documents using the DocDigitizer Sync API.

Overview

The Document Extraction API transforms PDF documents into structured data through:

  1. OCR - Convert images and scanned documents to text
  2. Classification - Automatically detect document type (invoice, receipt, etc.)
  3. Extraction - Pull out specific fields based on document type

Prerequisites

  • DocDigitizer account with API key
  • PDF document to process

Basic Extraction

Step 1: Upload Document

curl -X POST https://apix.docdigitizer.com/sync \
  -H "X-API-Key: your-api-key" \
  -F "[email protected]" \
  -F "id=$(uuidgen)" \
  -F "contextID=$(uuidgen)"

Step 2: Receive Extracted Data

{
  "StateText": "COMPLETED",
  "TraceId": "ABC1234",
  "Output": {
    "extractions": [
      {
        "documentType": "Invoice",
        "confidence": 0.95,
        "countryCode": "PT",
        "extraction": {
          "invoiceNumber": "INV-2024-001",
          "invoiceDate": "2024-01-15",
          "vendorName": "Acme Corp",
          "vendorNIF": "123456789",
          "totalAmount": 1250.00,
          "currency": "EUR",
          "lineItems": [
            {
              "description": "Product A",
              "quantity": 2,
              "unitPrice": 500.00,
              "total": 1000.00
            }
          ]
        }
      }
    ]
  }
}

Document Types

Invoice

Common fields extracted:

  • invoiceNumber - Invoice reference number
  • invoiceDate - Issue date
  • dueDate - Payment due date
  • vendorName - Seller/vendor name
  • vendorNIF / vendorVAT - Tax identification number
  • buyerName - Buyer/customer name
  • totalAmount - Total including tax
  • netAmount - Total before tax
  • taxAmount - Tax/VAT amount
  • currency - Currency code (EUR, USD, etc.)
  • lineItems - Array of items with description, quantity, price

Receipt

Common fields extracted:

  • merchantName - Store/merchant name
  • date - Transaction date
  • time - Transaction time
  • total - Total amount
  • paymentMethod - Cash, card, etc.
  • items - Purchased items

Contract

Common fields extracted:

  • parties - Contract parties
  • effectiveDate - Start date
  • expirationDate - End date
  • value - Contract value
  • terms - Key terms and conditions

Multi-Document PDFs

A single PDF might contain multiple documents. The API handles this automatically:

{
  "Output": {
    "extractions": [
      {
        "documentType": "Invoice",
        "pageRange": { "start": 1, "end": 2 },
        "extraction": { ... }
      },
      {
        "documentType": "Invoice",
        "pageRange": { "start": 3, "end": 4 },
        "extraction": { ... }
      }
    ]
  }
}

Confidence Scores

Each extraction includes a confidence score (0-1):

ScoreInterpretation
0.9+High confidence - reliable extraction
0.7-0.9Good confidence - review recommended
0.5-0.7Moderate confidence - manual review suggested
< 0.5Low confidence - manual verification required
for extraction in result["Output"]["extractions"]:
    if extraction["confidence"] >= 0.9:
        # Auto-process
        process_extraction(extraction)
    elif extraction["confidence"] >= 0.7:
        # Queue for quick review
        queue_for_review(extraction)
    else:
        # Manual processing
        queue_for_manual(extraction)

Using Context IDs

The contextID parameter groups related documents:

# Process multiple documents from the same batch
batch_context = str(uuid.uuid4())

for pdf_file in pdf_files:
    response = requests.post(
        "https://apix.docdigitizer.com/sync",
        headers={"X-API-Key": API_KEY},
        files={"files": open(pdf_file, "rb")},
        data={
            "id": str(uuid.uuid4()),
            "contextID": batch_context  # Same for all documents in batch
        }
    )

Specifying a Pipeline

Use pipelineIdentifier to select a specific processing pipeline:

curl -X POST https://apix.docdigitizer.com/sync \
  -H "X-API-Key: your-api-key" \
  -H "X-DD-Pipeline: MainPipelineWithOCR" \
  -F "[email protected]" \
  -F "id=$(uuidgen)" \
  -F "contextID=$(uuidgen)"

Or in the form data:

data={
    "id": str(uuid.uuid4()),
    "contextID": str(uuid.uuid4()),
    "pipelineIdentifier": "MainPipelineWithOCR"
}

Timing Information

The response includes processing time breakdown:

{
  "Timers": {
    "DocIngester": {
      "total": 2345.67,
      "ForwardRequest": 2300.00,
      "DocWorker": {
        "total": 2250.00,
        "OCR": 1200.00,
        "Classification": 450.00,
        "Extraction": 600.00
      }
    }
  }
}

Control timer detail level with the X-DD-LogLevel header:

ValueDetail Level
minimalBasic timing (default)
mediumComponent-level timing
fullDetailed breakdown

Error Handling

Handle extraction errors gracefully:

result = response.json()

if result["StateText"] == "ERROR":
    trace_id = result["TraceId"]
    messages = result["Messages"]

    if "File must be a PDF" in messages:
        print("Please provide a valid PDF file")
    elif "Invalid or missing API key" in messages:
        print("Check your API key")
    else:
        print(f"Error: {messages}. Contact support with TraceId: {trace_id}")
else:
    # Process successful extraction
    process_extractions(result["Output"]["extractions"])

Best Practices

  1. Use UUIDs for tracking - Generate unique IDs for each document
  2. Group with contextID - Use the same contextID for related documents
  3. Check confidence scores - Don't auto-process low-confidence extractions
  4. Log TraceIds - Store TraceIds for debugging
  5. Handle multi-doc PDFs - Process all extractions in the array
  6. Implement retry logic - Retry on 5xx errors with backoff

Next Steps