Exploratory Data Analysis: A primer on how to make data-driven decisions
More than ever, the ability to access enterprise data and insights are crucial to business success. Businesses are constantly on a hunt to find valuable insights and stay ahead of the competition. Around 90% of business leaders believe that data and analytics are key to their organization’s digital initiatives. Beyond the tools and techniques used for data analysis, it is essential to focus on the right approach to treat a dataset.
Exploratory Data Analysis (EDA) is an important aspect of data science that provides a bird’s eye view of the data. Using visual methods, EDA performs the initial analysis and investigation of data to identify patterns, understand the distribution of the dataset and summarize its key characteristics. It is a philosophy to approach a database without any rough assumptions and allows data scientists/engineers to use visual tools to learn from the data. Ideally, it is the first step before applying any statistical tools/techniques to the dataset.
Implementing EDA – Value beyond statistics
EDA is an iterative process that helps to generate questions about the data, search for answers by visualizing the data and then based on the understanding, refine the questions to further streamline and gain insights from the data. In this blog, you will learn how to arrive at the right questions to understand data. Also, you will know the importance of using the right statistical parameters when creating charts. To bring in clarity, we have used the example of the Vietnamese dataset in this article.
In the late 80s, Vietnam’s economy was in a downward trend and the government began a renovation policy to pull the country out of a famine situation. Using the data prepared by VLSS (Vietnam Living Standards Survey), they implemented EDA to gain insights from the dataset by asking the right questions. As an initial step, they wanted to understand the age distribution of the entire Vietnamese population.
Question 1: What is the age distribution for the Vietnamese population?
EDA through Histogram
Distribution of 28,633 ages from the VLSS data
The age distribution is plotted using a Histogram. The x-axis describes the age range and the y-axis describes the frequency (no. of person count). A sample code snippet to show how to plot a Histogram using Altair (a python implementation over Vega charts).
Calculating the bin size in the Histogram
By configuring the right number of bins brings out the hidden information from the data. In this case, using a bin size of 55 brings a distinct division on the young, the middle-aged, and the old. It is visible that the second-highest mode is around 40 years succeeding the first highest mode at 20 years. This information is not visible if the bin size is set to 10.
Distribution of 28,633 ages from the VLSS data with 55, 40, 25, and 10 bins
The size of the bins brings out additional information from the dataset. A data exploration doesn’t end with plotting but also assigning the right metrics to the chart’s attributes. When the bin size is not provided, the EDA packages apply default bin sizes (in the first code snippet) which will not display all the information in the dataset. If you know the data well, then you can decide the right size for the bins to bring out the insights. If you do not know the data, then calculate the bin size using one of the statistical methods like Freedman & Diaconis, Scott, Sturges, or Wand.
Implementation of a statistical method: Freedman & Diaconis
The code uses IQR and CBRT, an interquartile range of the given distribution and the cubic root of the total count of a given sample, to calculate the bin size. The above implementation gives the bin width to find the bin size. To calculate the bin size, divide column_range by bin_width. The following example explains how to calculate the bin size: if the bin_width=2 and the age range is between 0-100, then the number of bins for the histogram counterparts to 100/2 = 50 bins.
EDA through Density Plots – Kernel Density Estimator
Using Density plots instead of the histogram will remove the complexity of coming up with the right bin size. Density charts is a variation of Histogram using kernel smoothing to the plot values. The peaks of the density plots say about the concentration of values over the interval. The advantage of using a Density plot is it provides the shape of the distribution. This would not be obtained without using the size of a right bin with Histograms. Density Plots can be used in anomaly detection and novelty detection. The value which appears in the less concentrated area will be an anomaly.
The Kernel Density of a variable can be calculated using the below expression.
where p(x) gives the total bump, K is the kernel function, x is the point where the density is estimated and h is the bandwidth. K and h are to be provided by a methodologist. If not, the below statistical theory helps to determine the bandwidth.
The h value obtained cannot serve as the optimal bandwidth, but it can be used as the starting value of the bandwidth and then decremented until the plot becomes rough. If the bandwidth equals 0.5, then it is rough and if the bandwidth equals 2, then it becomes smoother.
When the plot bandwidth (2) looks smoother, it is easy to visualize the partitions of the age between young, middle-aged, and old, whereas it is tough to predict the data if the density curve is very smooth where the bandwidth is 10.
Using the above graphical ways, the age distribution of the Vietnamese population is analyzed. To gain further insights, it is necessary to begin analysis for the next question.
Question 2: Are there differences in the annual household per capita expenditures between the rural and urban populations in Vietnam?
Expenditures per capita
Since density plots can support more than one variable to be plotted in the same space, it is easier to compare both rural and urban data. The steep peak in the rural curve distinctly shows that the rural citizens are highest in poverty than the urban citizens. As it is visible in the graph, the right shift to the rural curve is less. It means a small increase in the expenditure will bring the rural citizens to below poverty status than the urban citizens.
The following code will help in calculating the mean difference between the expenditure per capita between the rural and urban citizens
Unfortunately, the graph also shows that both the groups are heavily skewed. And, it is not right to calculate the mean difference since the mean of the two distributions doesn’t fall at the center of the distribution. In this case, normalize the skewness and find the central dependencies using Winzoried mean or Trimmed mean.
To further streamline the decision-making process, it is essential to identify the regional annual expenditures of Vietnamese citizens.
Question 3: Are there differences in the annual household per capita expenditures between the seven Vietnamese regions?
Side-by-side boxplots of per capita expenditures by region
If there are more variables in distribution, use a box-plot. If you use a Density plot for showing the expenditure per capita, then the graph will showcase seven different plots for each region, or you can merge the seven graphs into one. Either way, it would become hard to extract information.
From the above example, the emphasis is not on whether the dataset has all the required samples ready for the machine learning process. The key to gain valuable insights relies on asking the right questions. Though we can answer some questions directly by applying a statistical formula, it is imperative to have visual interpretations to understand the distribution of a variable or to find the inter-dependencies between two variables in the set. The methods mentioned in this article apply to any dataset. Now with EDA, tune the visualization tools and parameters to get the correct insights or even the hidden insights for making the right data-driven decisions.