A contextual code search of over 1 Billion lines of code in GitHub projects

About the product
KodeBeagle is an open source software, offering analysis and searches for code references when required. KodeBeagle is a free code search engine indexing over 1 Billion lines of open source code from approximately 5,50,000 GitHub project repositories.

Tech stack
Elasticsearch
Apache Spark

The problem
Today, most search engines available for code search have a full-text search and the search result consists of textual occurrences of the search keywords. They have no way to know whether the searched keywords were types, fields or methods etc. Engineers struggle with code search tools everyday. So we asked ourselves, what if code search was non-textual? Our goal was to build a contextual code search tool that scans Github projects known for best practices and quality, and gets you correct usage of classes.

The solution
Most code search tools are textual. Kode Beagle on the contrary understands AST, scans thousands of Github projects, to find you the correct usage of classes. For instance, if you want to know how to use FileChannel and ByteBuffer, Kode Beagle will get code snippets from Github projects that use them, score the results for Github watches, relevance, quality and other parameters.

It is a context sensitive search that helps you with the right information at the right time. It helps you to search by terms and snippets. The context is taken from any given code snippet unlike other text based search platforms. The code searched is then checked for API specific idiom suggestions without losing the context.

The benefits
Kode Beagle provides contextual & intelligent code suggestions inside your IDE. Given code snippet, it shows most idiomatic usages. Leverages abundantly available “standard” code corpora to learn interesting and useful patterns. It allows code search through Natural Language queries summarizes new projects and files to aid quick learning.
In a test run, out of the 408 repos, 8 repos shared a common topic according to the analysis. We can also use mapreduce as a query term for these repos. As LSA does not need a pre-defined categorization, analysis did not need any prior information of terms like mapreduce. The advanced NLP research and big data engineering techniques help in increasing the speed of search.

We’re open for business
Let’s get started