- Lack of predefined classification formats for the documents
- Lack of consistent evaluation metrics to ensure high levels of accuracy in document classification
- Lack of hierarchical groups for industry verticals to classify documents at a granular level
Due to the complexity and scale of the project, Imaginea decided to execute the document classification through unsupervised ML clustering.
Multiple evaluation metrics helped to determine the best possible LDA model from different trained models, as well as identify the optimum groups/topics from a given model run.
The scope of the solution was designed to include the complete range of document groups, right from the minor topics to the topics of the highest importance. Also, a hierarchy of related documents was formed within a specific industry vertical.
How our solution helped
Built scalable pipeline with a potential to process 100 million records
Unsupervised ML clustering was implemented to classify the patent documents based on the number of documents and industry verticals. For this purpose, the LDA (Latent Dirichlet Allocation) NLP statistical model was applied to identify the right topic mix for the documents.
Big data tools were used for document preprocessing, clustering, post-processing, and model evaluation. The image below illustrates the complete document clustering process:
- Substantial improvement in the ETL (extract, transform, and load) process
- Higher data quality due to the availability of multiple evaluation metrics
- On-demand predictive model as a service with minimal maintenance and cost footprint
- Contributed LDA module to MLeap open source project