Winning the battle with data analytics using spring cloud data flow
Being data-driven is one of the most essential prerequisites for any organization to achieve the desired digital transformation. They are also making a real impact on the bottom line, which is even more compelling for organizations to treat their data as assets.
Today, organizations are competing on analytics not because they can do it, but because they must do it, due to the changing customer needs. This has driven the businesses to find answers for some of the key metrics like:
- What time of the day do most of the sales happen?
- How long in a day do the users watch the ads?
- Which type of device do users use the most for accessing the application?
- Which customer is likely to buy a website/in-app purchase?
- If a user signals buying intention, what is the preferred day or mode of payment?
- Who is likely to commit fraud?
Data analytics, done right has the potential to give answers to these vitally important questions. Now, with this agenda, many organizations aspire to be data-driven and compete with data analytics.
The bars were set high a decade before but the transformation process is slower than the expectation. Many proofs exist on the failures of modern enterprises going data-driven. One such study is the findings of NewVantage Partners’ 2019 Big Data and AI Executive Survey, thanks to Randy Bean.
The survey comprised of several c-level technologies and business executives representing large corporations such as American Express, Ford Motor, General Electric, General Motors, and Johnson & Johnson
Though there are tremendous investments in big data and AI initiatives, more than 50% of the surveyed organizations are not fully reaping the benefits of data-driven business models.
Why are enterprises failing to be data-driven?
Multiple factors have an impact on digital transformation efforts. In context with data, three V’s – Volume, Variety and Velocity – is a major influencer that makes or breaks a data-driven transformation.
Every now and then organizations upgrade the storage and processing power to support the expected data volume. Still, it is becoming difficult to handle the volume of ever-growing data produced and consumed within and outside the organization.
The variety of data produced by various IoT sensors, devices and applications are of mixed type – structured, semi-structured or unstructured. Many of the traditional, proprietary data platforms and tools often find it difficult to support some of the data types in an efficient way.
All these sensors and applications generate data at the speed of light. Extracting, Transforming and Loading (ETL) data in real-time and delivering the analytics at the same speed seem to be a herculean task for many ETL tools.
Aspiring to be data-driven
One of our customers, who is a leader in through-channel marketing automation, segmented in the small and medium-sized business (SMB) category, wanted to become real-time data-driven to compete on data analytics. They did not want to stay behind using traditional ETL tools in analyzing historical or near real-time data.
The client’s existing product had well-defined integrations that leveraged Pentaho (Kettle) ETL features. With these features, the business managed to thrive for almost a decade. Later in the business environment, where requirements keep changing dynamically, the old product was failing in these aspects:
- Analytics on Real-Time Data: The end users of the business wanted to make decisions based on real-time data rather than near-real-time/ old data.
- Not Business Developer Friendly: Business developers heavily depended on the ETL developers for every customization in the data pipeline.
- Lack of Automation: The number of iterations they had to go through in delivering the data pipelines to production were huge as it involved end-to-end manual steps.
- Lack of Real-Time Data Analytics: The end users of the business wanted to make decisions based on real-time data rather than near-real-time/ old data.
- Multiple SaaS Channels: The supported SaaS sources were – Google & Salesforce by Pentaho, but the business demanded to have most of the social platforms as their data sources. Pentaho’s latest version 8.0 supports real-time streaming data, but that is not designed to receive the stream from all the social platforms off the shelf. Instead, it requires a couple of steps to be configured here and there by ETL developers.
- Support for Multiple Authentication protocols: Each APIs demand a different set of authentication mechanism. Though most of the platforms support either OAuth / OAuth 2, there were needs to have Basic Authentication as well.
- Data Fields Selection & Transformation: The response of any streaming totally depended on how it is stored for further processing. Apart from storing or doing any transformation, nothing more could be expected from the ETL integration tool. Still, the challenges that need to be addressed included: achieving easy selection of fields from the response, and minimal transformation during format change for location-based fields like date/time and currency.
- Business Developer Friendly: Business developers should be able to create the pipeline without much effort for working with any of the social platforms. A simple drag and drop of components should allow creating any data pipeline. This tailor-made data pipeline creator is not something supported out-of-the-box by any of the data integration tools in the market.
- Time to Market: If a business developer wants to change a step in the pipeline – either by adding a filter or by removing a verification step, they should be able to deploy the process into the respective environment for testing and delivered to production in no time.
We conducted day-long workshops and design thinking sessions to understand the pain points of all the stakeholders – Product Managers, Business Developers and Data Engineers.
The functional and non-functional requirements were carefully collected and we have come up with a recommended solution of creating data pipelines dynamically by leveraging Spring Cloud Data Flow and it met the expectations!
The functional view
- Data Sources: The incoming real-time streams from various social platforms.
- API Connector: Connectors of supporting various API (SOAP, REST and Graph) & Authentication protocols (OAuth, OpenID, Basic).
- Metadata Picker: Lists of all the attributes from the respective stream which can be chosen for processing further processing.
- Data Formatter: Customisation of the attributes by Formatting & applying transformation logic is done here.
- DB Tables: The sink tables (GCP – Bigtable) where the data analytics queries get fired.
Authentication & Authorisation
The allowed users are from the GSuite (the enterprise’ Identity Provider) With the help of Cloud IAP, the access to the application is controlled. The access control of the application is done from Cloud IAP. The user authorization is granted from the GSuite (the enterprise’ Identity Provider).
GCR with App or Task Images
Container Repository of pre-built images of applications/tasks were developed as Source, Processor & Sink. These images are templates, used for creating the components in any stream. The images could be of source, processor or sink type.
These images are capable of accepting the configurable properties like endpoint URI, accessToken, Consumer Key and Consumer Secret and pass them to the underlying applications/tasks. These images can consume/process/receive the data based on the type.
Data Pipeline Creator UI
This is meant to:
- Configure the data sources with respective authentication configurations
- Render the metadata for selection/ filtering
- Specify the data formatting per data source
Data Pipeline Creator Service
A containerized microservice exposing REST APIs is used in the Data PipeLine Creator UI.
All the configurations specific to a data pipeline, from the source to destination, are preserved in the metadata database of the Service it belongs to.
This service abstracts out the Spring Cloud Data Flow services for the given needs as shown below:
Spring Cloud Data Flow Server
This server is responsible for creating a data pipeline (for the given configuration) per data source configuration and deploying in the Skipper Server as data streams.
It is configured with Kafka streams to support real-time data processing.
The recommended solution has been implemented; the enterprise has done the end-to-end automation of continuously deploying the real-time data services and pipelines into production without much help from ETL developers but by the business developers.The mission accomplished!
Of course, being data driven is a lot more than competing in data analytics – the journey for this engagement involved many steps including re-architecting their data before they became data-driven.
There are many reasons why an enterprise fails in achieving the goal to become data-driven. Irrespective of the number of excuses and failures, the amount of data continues to rise exponentially. According to the independent research firm IDC, the growth in connected IoT devices is expected to generate 79.4 Zeta Bytes of Data in 2025, which is just 5 years away..So, it is time to be aware, nurture data and become a data-driven enterprise.
Originally published in Hackernoon as Compete on Data Analytics using Spring Cloud Data Flow.