Getting started

DocDigitizer API
This API is intended for use by developers and external systems
It makes available the necessary integration mechanisms that allow for the integration of DocDigitizer services.

First steps:

  • Get an API KEY from DocDigitizer (www.docdigitizer.com) or use the default one already configured on this page
  • Use your API key to authenticate on this documentation page (Authorize button)
  • View and Try the different endpoints below

Things you should know

HTTP-Based, RESTful, with SSL
We attempt to follow the principles of Representational State Transfer (REST), This means DocDigitizerAPI does not store 'state' nor 'sessions'. Most of the endpoints use JSON data format for responses and, for most of the cases, requests.

We leverage the verbosity of the HTTP protocol. Methods that retrieve data require a GET request, methods that send it might require a PUT or a POST.

All communication with the API should be made over SSL (this is extremelly important)

Your data schema might not be my data schema
Just picture this: you read all of our docs, go through all the examples. Try it out, like what you see, but we're not dealing with the specific scenario for your business need. Are we not a good fit? Of course we are! :)

We focused our documentation and examples on an invoices and accounting scenario, since we feel that can be an area easy enough to grasp for both tech and non-techs alike. But at the moment we deal with clients in many sectors with very different and specific data needs. Even though we don't support the creation of data schemas through our API (yet) that is something we can do to target our app's needs. Make sure to drop us a line if you think you'd need some help with setting up a different data schema.

Authentication
DocDigitizer API uses API Key authentication for all the endpoints that need to be authenticated.
This is a specific schema you'll need to send along in the header for all the requests sent by your
app (our api doesn't store 'state' nor 'sessions' in our side).

To authenticate your requests, you'll just need to set the "Authentication" request header. Something like so:

We currently have a file size limit of 25Mb for the upload and support the following media types:

  • application/pdf
  • image/jpeg
  • image/png
  • image/tiff

In case you don't build your request with the media type information, we'll try to infer it from the file's name extension.

Our Resources

These are the most relevant resources in DocDigitizer Api.

Documents
It all starts on a document. These are image or pdf files you'll want to send to the api to be annotated. You can later get a list of documents you annotated, get their most recent annotations or their original assets. Documents have a class (which give it then a target data schema for the data extractions to fill in).

Tasks
When you send your documents to the DocDigitizer API, we'll then queue it to the DocDigitizer Engine service. You can get some information about this task and poll the api for the tasks status before checking for the documents result.

Extractions
Extractions are the results from the DocDigitizer Engine for a document. These are machine generated and follow whatever data schema is set for the document's class.

Reviews
If you so choose, you can have your documents being reviewed by our team of Data Curators. The Reviews endpoints present the merged data of extractions and reviews and also follow whatever data schema is set for a document's class.

Smart Linking

You might have noticed by now that some of our responses send along a "links" node that might occur more than once in the
response body.

Each of the links inside the "links" node provides a way of referencing other related resources within our api.

For instance, when you retrieve the annotations for a specific document.

The links node tells you:

  • annotations, the endpoint for the annotations of that same document
  • assets, the endpoint to get the original assets for the document
  • collection, the endpoint to get the list of other documents associated to your organization
  • reviews, the endpoint to get the reviews for your that same document
  • self, the endpoint for the document itself

Response pagination

The endpoints that return lists are usually paginated. This impacts both the way you need to send your requests but also the data and metadata we send you on the responses.

KEY

REQ/RESP

WHERE

X-More-Results-Available

RESPONSE

HEADER

X-Cursor-ID

RESPONSE

HEADER

X-Cursor-Was-Broken

RESPONSE

HEADER

maxRetrieve

REQUEST

QUERY

cursorId

REQUEST

QUERY

X-More-Results-Available
We'll send this on the response headers. It lets you know wether or not there are more results to fetch. This applies to both:

  • there are so few results, that pagination is not even needed
  • (OR) you reached the last page of results

X-Cursor-ID
We'll send this on the response headers. This represents the position from where to start for the next page. In order to get it, you'll need to do the same request again (same endpoint, same parameters) but change the cursorId query parameter. On this second response, you will get a new X-Cursor-ID

X-Cursor-Was-Broken
We'll send this on the response headers. You should worry with X-Cursor-ID other than making sure to send it back to us in order to get a subsequent page of data. If for some reason the value you send to us is invalid or corrupt, we'll set X-Cursor-Was-Broken to true. If all is fine, false

maxRetrieve
You should send this as a parameter in the request query string. We'll apply pagination with a default page size (we specify each endpoint's default value), but you can change the page size by setting an integer value to maxRetrieve.

cursorId
You should send this as a parameter in the request query string. If you just got a response, and the X-More-Results-Available header is true, you should do another request exactly the same as the previous one but set the cursorId query parameter to whichever value you just got in X-Cursor-ID whenever you are ready to fetch the next page.