ML/NLP-based patent portfolio categorization

Case study

ML/NLP-based patent portfolio categorization

The constant upward trend in innovation across diverse industries lead to the organic growth of patent generation and infringements. High-profile patent litigation, infringements and trials lead to substantial legal and settlement costs. Patent litigation has tripled over the past thirty years and estimated costs have crossed $300 billion. This happens due to the lack of access to reliable data on litigations and patent transactions.

To manage risks, in the ever-changing patent landscape, the decision makers need data and intelligence and they turn to service providers for searching, analyzing and gaining insights.


Our client is a world leader in the space of patent risk management services, specializing in defensive buying, patent acquirement syndication, patent intelligence, and patent consulting services. They had over 20 million documents to be categorized; manual categorization was not feasible. They wanted a solution to:

  • Dynamically identify and group similar patents
  • Provide insights into patent portfolios


  • Lack of predefined classification formats for the documents
  • Lack of consistent evaluation metrics to ensure high levels of accuracy in document classification
  • Lack of hierarchical groups for industry verticals to classify documents at a granular level


Due to the complexity and scale of the project, Imaginea decided to execute document classification through unsupervised ML clustering.

Multiple evaluation metrics helped to determine the best possible LDA model from different trained models, as well as identify the optimum groups/topics from a given model run.

The scope of the solution was designed to include the complete range of document groups, right from minor topics to the topics of highest importance. Also, a hierarchy of related documents was formed within a specific industry vertical.

Tech stack

How our solution helped

Built scalable pipeline with a potential to process 100 million records

Overall approach

Unsupervised ML clustering was implemented to classify the patent documents based on the number of documents and industry verticals. For this purpose, the LDA (Latent Dirichlet Allocation) NLP statistical model was applied to identify the right topic mix for the documents.

Big data tools were used for document pre-processing, clustering, post-processing, and model evaluation. The image below illustrates the complete document clustering process:


  • Substantial improvement in the ETL (extract, transform, and load) process
  • Higher data quality due to the availability of multiple evaluation metrics
  • On-demand predictive model as a service with minimal maintenance and cost footprint
  • Contributed LDA module to MLeap open source project

Talk to us