Skip to main content

Document Text Extraction

This article describes the legacy invoice data extraction with the invoice coding API. For new projects, we suggest that you use Kaunt Document AI for invoice data extraction.

Our Document Extraction solution currently uses the prebuilt-invoice model provided by Azure Cognitive Services. The current version used is the API version 3.1 to extract data from invoices in various formats such as PDF, PNG, JPEG, TIFF, etc. This API provides advanced machine learning models that can accurately extract structured data from unstructured documents. By leveraging this API we can map the extracted data to the Kaunt data model and create a new invoice mapped to the Kaunt data model. In this way, Kaunt can support a wide variety of invoice formats and layouts and process invoices for customers that have a large amount of unstructured invoice formats. The preceding illustration shows a high-level overview of the flow when using the document text extraction service by Kaunt.

High-level illustration of the document text extraction service. High-level illustration of the flow when using the document text extraction service by Kaunt.

While we strive to provide accurate and reliable data extraction services, Kaunt does not take any responsibility for any inaccuracies or errors in the extracted data. The accuracy of the extracted data is dependent on the quality and format of the input document, as well as the performance of the underlying machine learning models. It is the responsibility of the user to verify the accuracy of the extracted data and make any necessary corrections.

Target Audience

The primary target audience for the Document Extraction endpoint comprises users dealing with invoices presented in diverse, unstructured formats and layouts. This endpoint is designed to facilitate the extraction of data from such invoices and intelligently map it to the Kaunt data model. This capability empowers users to process invoices that might not be natively supported by the Kaunt API.

However, if you already possess well-defined invoice formats and layouts, you may not require the Document Extraction endpoint. In such cases, we recommend utilizing the Kaunt API directly with the supported invoice formats and layouts. For a comprehensive list of these supported invoice formats and layouts, please visit our Invoice Formats documentation.

Mapping Between Form Recognizer and Kaunt Data Model

Azure Form RecognizerKaunt
CustomerNameBuyer.Name
CustomerIdN/A
PurchaseOrderOrderNumber
InvoiceIdVendorInvoiceNumber
InvoiceDateInvoiceDate
DueDateDueDate
VendorNameVendor.Name
VendorAddressVendor.Address
VendorAddressRecipientVendor.Contact.Name
CustomerAddressBuyer.Address
CustomerAddressRecipientBuyer.Contact.Name
BillingAddressN/A
BillingAddressRecipientN/A
ShippingAddressDeliveryAddress
ShippingAddressRecipientDeliveryContact.Name
SubTotalAmountExVAT
TotalDiscountNot Mapped
TotalTaxvatAmount
InvoiceTotalAmountInclVAT
AmountDueAmountInclVAT (If InvoiceTotal not present)
PreviousUnpaidBalanceNot Mapped
RemittanceAddressNot Mapped
RemittanceAddressRecipientNot Mapped
ServiceAddressNot Mapped
ServiceAddressRecipientNot Mapped
ServiceStartDateNot Mapped
ServiceEndDateNot Mapped
VendorTaxIdVendor.TaxIdentificationNumber
CustomerTaxIdBuyer.TaxIdentificationNumber
PaymentTermPaymentInformation.PaymentTermsId
PaymentDetailsNot Mapped
PaymentDetails.*Not Mapped
PaymentDetails.*.IBANNot Mapped
PaymentDetails.*.SWIFTNot Mapped
TaxDetailsNot Mapped
TaxDetails.*Not Mapped
TaxDetails.*.AmountNot Mapped
TaxDetails.*.RateNot Mapped
Items.*.AmountLineAmountInclVAT
Items.*.DateNot Mapped
Items.*.DescriptionDescription
Items.*.QuantityQuantity
Items.*.ProductCodeName
Items.*.TaxLineVatAmount
Items.*.TaxRateNot Mapped
Items.*.UnitUnitOfMeasure
Items.*.UnitPriceNot Mapped
Items.*.Amount - Items.*.TaxLineAmountExVAT

To see a description of the fields extracted from form-recognizer, see the official Azure Form Recognizer Invoice Model Docs.

Invoices without Line Items

Some invoices do not contain line items, although this is a requirement by Kaunt. In case Azure Form Recognizer cannot identify any line items on an invoice, Kaunt will automatically create an invoice line with the header amounts. This invoice line will have its lineNumber set to GeneratedHeaderLine.

How do i get access to the Document Text Extraction endpoint?

By default, access to the document text extraction endpoint is disabled for all users. To request access to this endpoint, please reach out to your designated contact at Kaunt or send an email to support@kaunt.com to be signed up for our partner portal."