OCR based data extraction from price quotations using Abbyy and UiPath

Case study

OCR based data extraction from price quotations using Abbyy and UiPath

Large organizations have the unenviable task of processing a large number of quotations on a daily basis. It involves the handling of large volumes of digital and physical documents. The purchase department has to make a comparison analysis of the prices quoted by suppliers and check additional information, such as technical specifications, delivery deadlines, payment methods, and much more. So, the information has to be accurately extracted and there is little or no room for human error.

Our client is a behemoth in the distribution of electronic components, computer products, and provides related value-added services. They are present across the globe and bring in revenue of more than $20 billion a year. They receive a large number of price quotations from their suppliers.

Requirements

The price quote reviewers must be able to scrutinize multiple quotation information, quickly and efficiently, to arrive at the best possible business decisions. Identifying relevant data from these quotes was time consuming and erroneous. They wanted to automate the data extraction process from the price quotations and upload it to their in-house workbench application.

Challenges

  • Current manual data extraction process is extremely time-consuming, monotonous, and rule-driven.
  • No standardized quotation format for the suppliers, which led to inconsistent data capture.

Solution

Our client receives price quotes in multiple formats from their suppliers. The same information might have different field names in different quotation forms. For example, SKU (stock keeping unit,) in one form may be mentioned as Product Unit in another form, even though both the fields contain the same information. Automating the data extraction process would positively impact the overall efficiency of the process and drastically reduce the data extraction cycle time.

Imaginea proposed the implementation of an OCR (Optical Character Recognition) solution to automate the process of identifying relevant information from the price quotes and updating it to their workbench. The data was first extracted from various quotations through the use of an OCR engine. The data was then compiled into a CSV file and uploaded to the in-house application through an OCR engine connector.

Tech stack

How our solution helped

Using OCR engine to extract relevant information from price quotes led to an overall 80% reduction in the data extraction cycle time.

Overall approach

The first step was to establish the process. We identified OCR as the best approach as the price quotes were in PDF, scanned, and print formats. The next step was to establish a process goal. The objective was to improve process efficiency, reduce costs, and process accuracy. After determining the OCR fitment, we analyzed the content and listed the following text and input attributes that needs to be considered for data extraction:

  • Text density: the text density on the printed quotations was thick.
  • Text structure: The text in the PDF files were structured. Also, the text structure might be different from the numerical values, such as SKU, product ID, and so on.
  • Fonts: Printed fonts were easier to capture, as they are structured compared to hand-written characters.
  • Character type: Only English was used with alpha and alpha-numerical characters.
  • Images: Scanned copies of the quotations would be easier to read, than a photograph of the quotations.
  • Location: Need to handle cropped/centred text or random locations of text in the images.

The opportunities for OCR implementation were identified and prioritized on the following criteria and managed in phases:

The following diagram provides an overview of the OCR solution:

The following diagram provides an overview of the overall application architecture:

Results

  • Reduced data extraction cycle time per quotation from 24 minutes to 5 minutes
  • Substantial reduction in human error

Talk to us