Accelerating data extraction: CRF with spatial features for Named Entity Recognition
Organizations today are overloaded with documents of different formats from multiple parties. Given the sheer number of documents and complexity in processing data from the documents, organizations need a streamlined approach to extract data and gain value. Using Natural Language Processing (NLP) techniques, the extraction of relevant information from unstructured documents has become efficient.
Named Entity Recognition (NER) is a specific task of NLP, which aims at extracting entities like person, location and date. There are a number of applications of NER, which can help in solving problems related to domains like medical, education, legal, and so on. Some of the most common use cases include classifying digital contents, efficient search algorithms, content recommendation, and customer support.
Applying ML techniques to NER
Approaches to entity extraction span a broad range — from linguistic rules-based to machine learning-based; there are some hybrid approaches as well. The dictionary-based method is a common NER technique as it plays a key role in understanding the text. Most dictionary-based NER systems focus on:
- Integrating and normalizing different biomedical databases to improve the quality of the dictionary to be used
- Improving matching strategies that are more suitable for biomedical terminologies
- Making filtering rules for post-processing to refine the matching results or to adjust the boundary of entities
Many information extraction systems have a dictionary matching module to perform preliminary detection of named entities. Many have tried image segmentation-based approaches as well, like chargrid (Katti et al., 2018) or BERTgrid (Denk et al., 2019).
However, applying machine learning techniques generally means superior performance for any NER tasks. The automated learning process can induce patterns for recognizing entities and rules for pre- and post-processing. Generally speaking, there are two categories of machine learning-based methods: one treats NER as a classification task, while the other treats NER as a sequence labeling task. The sequence labeling task can be approached by the Hidden Markov Model (HMM), Conditional Random Field (CRF) or a combination of different models. Since the time it was proposed by Lafferty et al., CRF has been applied to many sequence labeling tasks.
Extracting invoice entities using spatial features
Invoice-NER has more difficulties than normal NER because of the fact that invoices are not constructed of coherent text lines, but mostly tables and small text blocks. So, regular NLP algorithms don’t work in these cases.
In this blog, we will explain how linear-chain crf with spatial features can be used to extract named entities from invoices. The linear-chain crf models the dependencies between tags of the previous and next words. However, in case of invoices, the tags of the words, which are above and below a word are also dependent. The 2D-CRF like Grid CRF might capture these dependencies.
Before proceeding, let’s explore what is CRF, how it works, and what is the importance of spatial features in extracting named entities from invoices.
What is CRF?
It is a class of discriminative models used for prediction tasks. The idea is similar to logistic regression where we try to model p(y|x), but with a trick. Rather than using one weight vector per class, a single set of weights is shared across all the classes. The idea is to define a set of feature functions that are non- zero only for a single class.
How does it work?
Let G be a factor graph over X and Y. Then (X,Y) is a conditional random field if, for any value x of X, the distribution p(Y|X) factorizes according to G.
To build the conditional field, we next assign each feature function a set of weights (lambda values), which the algorithm is going to learn. The model equation is described below:
pθ(y|x)=exp(∑jwjFj(x,y))/∑y′exp(∑jwjFj(x,y′)), where Fj(x,y)=∑Li=1fj(yi−1,yi,x,i)
Spatial features and their importance
To find out the most important information in an invoice, one needs to scout for some associated hint. It might be the words to its left or to its right or it might be words above and below it. To capture the words above and below, we have introduced spatial features. They help us utilize the dependency of a class to its surrounding words.
Let us consider a situation of extracting important information from documents such as invoices and receipts, as shown in image below, by using layout, structure and content of the text.
Most of the important information in invoices will be in the form of key value pairs and tables. The keys can be on the left or on top of the values like in tables. The tag of a word might depend on the features of the words above and below it. The linear features won’t be able to utilise these relevant features. Most of the techniques focused on entity extraction using CRF try to leverage the linear features. But in the case of invoices, these methods miss some of the extremely important features.
So, we introduced spatial features techniques to make use of these extremely important features. For example, the most important features for the client name (Vishwas Anand) and seller name (Lunchbox) are ‘Delivery to’ and ‘Ordered from’ respectively, as shown in the image above.
We tagged around 1400 documents, of which 1000 were used for training and 400 for validation.
For tagging, we used labeling, which helps us to assign a class to a rectangular region and gives an xml file for the same.
Then, we used Tesseract to get words and their bounding boxes in .tsv format.
Using the xml files and their corresponding tsv files, we assigned a particular class to each word, i.e., all words inside a particular rectangle area will be assigned to one class.
The images used in the blog are dummy images and are similar to the dataset used for training the model. We tagged 15 classes in the documents, and the datasets include around 39 different templates.
While implementing, we used linear-chain crf with spatial features. Here are the steps we followed to solve this problem:
- First we cleaned the data frames by removing the rows, in instances where:
- Tesseract confidence is less than 30%
- Height of a word is greater than four times the average height of words or less than half of average height
- Word width is greater than one-third the image width or rows with empty strings or null values
- Initially, even after sorting the words by top and left, the words were out of sequence. That could have been because the characters ‘i’ and ‘f’ had different heights or because of distortions in the image. So, we assigned line numbers, which involves two steps:
- Getting the connection between words based on lowest distance and lowest angle between them; and for each word, we continue storing the next connected word, horizontal closest distance and line_number as horizontally_closest_features.
- Sorting it on the basis of top and left, and giving the same line number to all the connected words by iterating the dataframe
- We then got spaCy features like labels, pos tags and found out whether a word is present in the vocabulary or not using the en_core_web_lg model of spaCy.
- Then we extracted spatial features. We took the center of each word as the origin and took a bounding box around the word with width as width of the invoice image and height almost thrice the height of the word, both above and below the origin. We then divided the bounding box into six quadrants each with an angle of 60° as shown in the image below. Thereafter, we stored the nearest word and its features for each quadrant, i.e., distance and angle from center, pos tag, label for each quadrant form the current word.
We then created features for each word, since crfsuite takes the dictionary of features as input for each word. The features used are listed below:
- Simple features: It includes basic features like if there is a digit in the word or not, whether it starts with capital letter or small letter, whether there is a special character in it, pattern of first for characters.
- RegEx features: It includes features like whether it’s a part of a date, price, phone number, email and so on.
- Linear features: It includes features of the previous three and next three words of the current word. In this problem, we took words that had the same line numbers as the current word.
- Spatial features: These features include surrounding words and their features for each word.
We then trained the model using a library called sklearn crfsuite, which uses the Limited-memory BFGS (L-BFGS) optimization algorithm and used Sklearn RandomizedSearchCV to find the best c1 (the coefficient for L1 regularization) and c2 (the coefficient for L2 regularization). The F1 score without spatial features was 84.6% but with spatial features, it increased to 90.3%.
Finally, the flat classification report for the model with spatial features looked like this:
Appendix: Other alternatives
There are other approaches that might be worth trying:
- Instead of manually creating features, we can use LSTM with positional embedding and use CRF on top of that.
- We can implement segmentation-based approaches like Chargrid or BERTgrid.