Python SDK


Installation

pip install docdigitizer

Requirements: Python 3.10+. The only runtime dependency is requests.

Quick Start

from docdigitizer import DocDigitizer

client = DocDigitizer(api_key="your-api-key")
result = client.process_document("exam.pdf")

print(result.output.doc_type)   # "MedicalOrder"
print(result.output.country)    # "PT"

for field in result.output.fields:
    print(f"{field.name}: {field.value}")

What do I get back?

process_document() returns a ProcessingResult. Here's a real example:

result.output.doc_type     → "MedicalOrder"
result.output.confidence   → 0
result.output.country      → "PT"
result.output.fields       → [
    ExtractionField(name="patientName", value="JOÃO JOSÉ DA COSTA FERNANDES"),
    ExtractionField(name="patientId", value="1527620"),
    ExtractionField(name="examDescription", value="ECO OSTEOARTICULAR - JOELHO"),
    ExtractionField(name="requestingDoctor", value="Fernando Marques Moura"),
    ExtractionField(name="clinicalInformation", value="dores no joelho, ..."),
    ...
]
result.trace_id            → "4ESZ13H"
result.document_id         → None

ProcessingResult

FieldTypeDescription
outputExtractionOutputExtracted data (see below)
trace_idstr | NoneTrace ID for debugging
document_idstr | NoneDocument identifier
context_idstr | NoneContext identifier
timersdictProcessing time breakdown
headersdictRaw response headers
rawdictFull raw JSON response

ExtractionOutput

FieldTypeDescription
doc_typestr | NoneDetected document type (e.g. "MedicalOrder", "INV")
confidencefloat | NoneClassification confidence (0.0 – 1.0)
countrystr | NoneDetected country code (e.g. "PT")
fieldslist[ExtractionField]Extracted key-value fields
rawdictRaw output data

ExtractionField

FieldTypeDescription
namestrField name (e.g. "patientName", "total")
valueAnyExtracted value
confidencefloat | NoneExtraction confidence (0.0 – 1.0), if available
bounding_boxdict | NoneBounding box coordinates on the page, if available

Ways to send the file

# 1. Path string
result = client.process_document("path/to/document.pdf")

# 2. pathlib.Path
from pathlib import Path
result = client.process_document(Path("documents") / "exam.pdf")

# 3. File-like object (binary mode)
with open("document.pdf", "rb") as f:
    result = client.process_document(f)

Optional parameters

result = client.process_document(
    "document.pdf",
    document_id="550e8400-e29b-41d4-a716-446655440000",
    context_id="7c9e6679-7425-40de-944b-e07fc1f90ae7",
    doc_type="MedicalOrder",
    country="PT",
    extra_params={"customField": "value"},
)
ParameterTypeWhy use it?
document_idstrAttach your own ID to the document for tracking
context_idstrGroup related documents together (e.g. same workflow)
doc_typestrHint the document type (e.g. "INV", "MedicalOrder") for better extraction
countrystrHint the country (e.g. "PT") for better extraction
extra_paramsdictSend additional parameters specific to your use case

Errors

The SDK raises typed exceptions. In practice, these are the 3 that matter:

from docdigitizer import DocDigitizer, AuthenticationError, APIError

try:
    result = client.process_document("document.pdf")
except AuthenticationError:
    # Invalid or expired API key — don't retry, fix credentials
    print("Invalid credentials")
except APIError as e:
    if e.status_code in (503, 504):
        # Service temporarily unavailable — worth retrying with backoff
        print(f"Service unavailable ({e.status_code}), retrying...")
    else:
        # Other API errors (400, 404, 500, ...) — typically not worth retrying
        print(f"API error: {e}")
ExceptionWhenRetry?
AuthenticationErrorInvalid/expired API key (401)No — fix credentials
APIError with 503/504Service temporarily unavailableYes — with backoff
APIError with other codesBad request, not found, server errorGenerally no

All exceptions extend DocDigitizerError. For more granular exceptions (e.g. BadRequestError, NotFoundError, ServerError), see docdigitizer.exceptions.

Advanced options

Sub-clients

DocDigitizer exposes sub-clients for functionality beyond document processing:

client = DocDigitizer(api_key="your-api-key")

# Registry — browse document types, countries, schemas
doc_types = client.registry.list_doc_types()
countries = client.registry.list_countries()
match = client.registry.find_best_schema(doc_type="INV", country="PT")

# Admin — CRUD on registry resources (requires elevated permissions)
client.admin.create_doc_type(code="INV", name="Invoice")

# Health checks
health = client.sync.health()

Authentication

# API Key (sent as X-API-Key header)
client = DocDigitizer(api_key="your-api-key")

# Bearer Token (sent as Authorization: Bearer header)
client = DocDigitizer(bearer_token="your-token")

# Custom auth strategy
from docdigitizer.auth import AuthStrategy
client = DocDigitizer(auth=my_custom_auth)

Custom URLs and timeouts

client = DocDigitizer(
    api_key="your-api-key",
    sync_base_url="https://custom.example.com/sync",
    registry_base_url="https://custom.example.com/registry",
    sync_timeout=180.0,       # default: 120s
    registry_timeout=60.0,    # default: 30s
)

Context manager

with DocDigitizer(api_key="your-api-key") as client:
    result = client.process_document("document.pdf")
# HTTP sessions closed automatically