Document Extraction Quickstart

Document Extraction Quickstart

Learn how to extract structured data from documents using the DocDigitizer API.

Overview

The Document Extraction API transforms PDF or image documents into structured data through:

  1. OCR - Convert images and scanned documents to text
  2. Classification - Automatically detect document type and schema (invoice, receipt, etc.)
  3. Extraction - Pull out specific fields based on schema detected

Prerequisites

  • DocDigitizer account with API key
  • PDF or image document to process

Basic Extraction

Step 1: Upload Document

curl --request POST \
     --url https://api.docdigitizer.com/v3/docingester/extract \
     --header 'X-API-Key: your-api-key-here' \
     --header 'accept: application/json' \
     --header 'content-type: multipart/form-data' \
     --form files='@invoice.pdf'

Step 2: Receive Extracted Data

{
  "stateText": "COMPLETED",
  "traceId": "ABC1234",
  "pipeline": "MainPipelineWithOCR",
  "numberPages": 2,
  "output": {
    "extractions": [
      {
        "schemaName": "Invoice",
        "confidence": 0.95,
        "pages": [
          1,
          2
        ],
        "extraction": {
          "invoiceNumber": "INV-2024-001",
          "invoiceDate": "2024-01-15",
          "vendorName": "Acme Corp",
          "vendorNIF": "123456789",
          "totalAmount": 1250.00,
          "currency": "EUR",
          "lineItems": [
            {
              "description": "Product A",
              "quantity": 2,
              "unitPrice": 500.00,
              "total": 1000.00
            }
          ]
        }
      }
    ]
  },
  "DocumentId": "f8b969cd-fbf5-4048-8d66-ea90283d4eb9",
  "Timestamp": "2026-05-04T14:44:43.3512947Z",
  "timers": {
    "DocIngester": {
      "Total": 2345.67
    }
  }
}

Document Types (examples)

Invoice

Common fields extracted:

  • invoiceNumber - Invoice reference number
  • invoiceDate - Issue date
  • dueDate - Payment due date
  • vendorName - Seller/vendor name
  • vendorNIF / vendorVAT - Tax identification number
  • buyerName - Buyer/customer name
  • totalAmount - Total including tax
  • netAmount - Total before tax
  • taxAmount - Tax/VAT amount
  • currency - Currency code (EUR, USD, etc.)
  • lineItems - Array of items with description, quantity, price

Receipt

Common fields extracted:

  • merchantName - Store/merchant name
  • date - Transaction date
  • time - Transaction time
  • total - Total amount
  • paymentMethod - Cash, card, etc.
  • items - Purchased items

Contract

Common fields extracted:

  • parties - Contract parties
  • effectiveDate - Start date
  • expirationDate - End date
  • value - Contract value
  • terms - Key terms and conditions

Multi-Document PDFs

A single PDF might contain multiple documents. The API handles this automatically:

{
  "output": {
    "extractions": [
      {
        "schemaName": "Invoice",
        "pages": [1, 2],
        "extraction": { ... }
      },
      {
        "schemaName": "Receipt",
        "pages": [3, 4],
        "extraction": { ... }
      }
    ]
  }
}

Confidence Scores

Each extraction includes a confidence score (0-1):

ScoreInterpretation
0.9+High confidence - reliable extraction
0.7-0.9Good confidence - review recommended
0.5-0.7Moderate confidence - manual review suggested
< 0.5Low confidence - manual verification required
for extraction in result["output"]["extractions"]:
    if extraction["confidence"] >= 0.9:
        # Auto-process
        process_extraction(extraction)
    elif extraction["confidence"] >= 0.7:
        # Queue for quick review
        queue_for_review(extraction)
    else:
        # Manual processing
        queue_for_manual(extraction)

Timing Information

The response includes processing time breakdown:

{
  "timers": {
    "DocIngester": {
      "Total": 2345.67,
      "ForwardRequest": 2300.00,
      "DocWorker": {
        "total": 2250.00,
        "OCR": 1200.00,
        "Classification": 450.00,
        "Extraction": 600.00
      }
    }
  }
}

Control timer detail level with the X-DD-LogLevel header:

ValueDetail Level
minimalBasic timing (default)
mediumComponent-level timing
fullDetailed breakdown

Error Handling

Handle extraction errors gracefully:

result = response.json()

if result["stateText"] == "ERROR":
    trace_id = result["traceId"]
    messages = result["messages"]

    if "File must be a PDF" in messages:
        print("Please provide a valid PDF file")
    elif "Invalid or missing API key" in messages:
        print("Check your API key")
    else:
        print(f"Error: {messages}. Contact support with TraceId: {trace_id}")
else:
    # Process successful extraction
    process_extractions(result["Output"]["extractions"])

Best Practices

  1. Check confidence scores - Don't auto-process low-confidence extractions
  2. Log TraceIds - Store TraceIds for debugging
  3. Handle multi-doc PDFs - Process all extractions in the array
  4. Implement retry logic - Retry on 5xx errors with backoff

Next Steps