Blog
BLOG

Building Data Pipeline 101

Haridas N
   

IDC predicts that the collective sum of the world’s data will grow from 33 zettabytes to 175 zettabytes by 2025, for a compounded annual growth rate of 61 percent. Buried deep under the ocean of data is the “captive intelligence” that companies can use to improve and expand their business. Machine learning (ML) helps businesses manage, analyze, and use data more effectively than ever before. It identifies patterns in data through supervised and unsupervised learning, using algorithms to get actionable insights. Recommendations in travel, shopping, and entertainment websites are typical examples, which use consumer data to make personalized offerings.

Data pipeline – An overview

A carefully managed data pipeline can provide companies with seamless access to reliable and well-structured datasets. Data pipelines are a generalized form of transferring data from a source system A to a source system B. These systems can be developed in small pieces, and integrated with data, logic, and algorithms to perform complex transformations.

To develop a scalable data pipeline, we need to select the technology and the process associated with it carefully. The tech stack and the framework used in the POC (proof of concept) stage of data pipelines, usually falls short in a production environment due to workloads and other issues. The choice of architecture plays a crucial role in managing a massive volume of data and decoupling different components. Therefore, to move a prototype to a production environment, a robust and scalable data pipeline has to be built.

Data Pipelines can be broadly classified into two classes:-

1. Batch processing processes scheduled jobs periodically to generate dashboard or other specific insights

2. Stream processing processes / handles events in real-time as they arrive and immediately detect conditions within a short time, like tracking anomaly or fraud

For both batch and stream processing, a clear understanding of the data pipeline stages listed below is essential to build a scalable pipeline:

1. Data collection and preprocessing

2. Feature design and extraction

3. Feature and model storage

4. Training the model

5. Serve trained model

Data collection and cleansing

The success of the model relies on the type of data it is exposed to, so collecting and cleaning data plays a significant role in the data pipeline. For example, if the data comes from sources like databases or crawlers, a batch collection should happen; similarly, if the data comes from sources like IoT events, stream collection should happen. Based on the data source type, we can decide to proceed with either the batch-wise data collection or stream-based collection.

  • Stream pipelines are more of like micro-batches with the time interval between each event or data in seconds or milliseconds or even lower
  • Batch pipelines expect data to arrive periodically like every hour, or day or week. The primary thing to ensure in a batch pipeline is that the operations involved with each batch should be finished before the next batch

Data cleansing and structuring are an integral part of the data collection, in both ETL (Extract, Transform and Load) and ML pipelines. The data from the source may not be in the form that we want or may have a lot of unwanted details. Sometimes we may have to enrich the data with some more information based on the data source. At this stage, aggressive cleaning is not required; instead, obvious issues like cleaning Unicode issues and structuring the data into JSON / parquet / Avro schema are done. The focus is on ensuring that the data is used to downstream pipeline and data analytics. A few tools generally used in this stage are Apache Spark, Apache NiFi, Apache Kafka, Apache Airflow.

Data that comes out of this stage and other intermediate results from different stages goes to the storage layer.

Storage Layer

The cleansed data, along with the generated features, need to be stored, so that it can be shared to the down-stream pipelines. In pipelines, storage is a loosely coupled unit, where the data is saved in standard formats, and not necessarily in the form required by the downstream systems. The data stored in the storage layer is still raw and can be used for multiple use cases. The storage systems may have interfaces or APIs through which the downstream pipelines can access relevant data. Based on the storage layer, we can enable the downstream pipeline to work in batch or stream mode. All the intermediate results from different stages are also stored here.

The big data storage like Hadoop Distributed File System (HDFS), Amazon S3, GoogleStorage (GS) or NoSQL scalable storages like Cassandra can be used as storage layers. Most of the big data tools are compatible with these storage services or inhouse Hadoop clusters. A downstream pipeline can process the data in batch or stream mode and store them in HDFS or other big data storage based on the volume and liveliness requirements.

Feature extraction for different models

According to Oracle, feature extraction is an attribute reduction process, which results in a much smaller and richer set of attributes. Depending on the requirements, identifying and extracting informative and compact data sets (for an ML model) may need structured data like numbers and dates or unstructured data like categorical features and raw text. If the data volume is large, the feature extraction can be handled separately, and the generated features can be stored in the storage layer. The format of the stored features is ready for direct consumption by the ML training process in the next phase.

The feature extraction can be done for a wide range of applications like simple ETL process, model prediction pipeline, or retraining the model based on new data to improve the model accuracy. The processing can be done online (stream processing) or offline (batch processing). Processing data by the hour, day, or at fixed intervals can be considered as batch processing, whereas data processed under the fraction of a minute/second can be considered as stream processing. The processing type varies based on the data availability and the use case at hand.

For Stream processing:
For deciding the dynamic prices of an online taxi booking service, real-time local demand from a given area is required. For this process, the request from a given area needs to be processed in real-time (usually under a minute), therefore to build this pipeline, online or stream processing is necessary.

For Batch processing:
Most of the ML models follow batch processing. The data science teams work on the data exported from various means independently to make the data usable. Later, it can be moved into production with the help of automated release management tools. For example, it is common in the industry for the clients to outsource their entire datasets, based on which ML models are developed. Python, Java apps, spark are some of the tools used for feature extraction.

Model design and Training

Model Design is a phase where finding out which type of models are best fit to solve the problem at hand, and this is basically decided based on the data, type of data, data availability, and problem type. At this stage, data is decoupled from the standard pipeline, and processed offline for both types of data (stream and batch), so that it can be consumed at will. Soon after the model design, the model is trained using the preprocessed data or the prepared features.

Once the data is available, via standard pipelines, the data science teams start training the model. After tuning the model for maximum performance, it can be moved into the release pipeline by following the standard release management and ops processes setup. Some of the tools used in this stage are Keras, Tensorflow, and Spark.

Serve trained Model

This is the phase where the trained models are deployed into live systems or used in real datasets. It follows the standard software deployment pipeline process and release management process for releasing model versions.

Multiple platforms are available to ensure a smooth deployment and lifecycle management of ML models. Platforms like TFServing, MLFlow can be used to simplify the ML model deployments and its lifecycle management.

To ensure the quality of the model, multiple validations are done with the help of AB-testing or Beta testing. Tools like TFServing and spark help serve the trained model.

What next?

A solid data pipeline holds the promise of transforming the dark-data hidden in silos. Having a flexible, efficient and economical pipeline with minimal maintenance and cost footprint allows you to build innovative solutions.