When starting in a blank data science project the most obvious first step is to gather the correct dataset. This can often turn out to be a very time consuming task because it involves several stages. First of all the researcher has to translate a very abstract question into something that could be measured. An example of that could be a question from management like “How can we increase profits” that needs to be translated to “What is the optimal pricing strategy”. This task can be broken down into understanding three different subtasks.
y
.X
.(y=f(X))
.After structuring the business problem, the researcher usually has to retrieve the data from a database management system. This step is more technically heavy than the previous one because it is not about abstract definitions and mathematics, but about creating a pipeline that shapes the raw data, in a form useful for analysis.
No, unfortunately not. Before we deep dive into this part, we ought to spend some time understanding our cleaned dataset and this is arguably, the most important task.
These are among, other things the reasons that we should focus on the EDA (Explanatory Data Analysis) process. In this article I am going to provide a template of what such analysis could include on a toy dataset. This is one of the many ways to conduct EDA. You should be able find many articles online with far greater detail, however in my article, I aim to highlight the big ideas.
In a later article I will show how these big ideas can be used to validate our model of choice and when I publish this article I will link it here.
The dataset used in this article is the wine dataset. It is an especially clean and very well-known real-life dataset for prototype creation. More specifically, it is a 4898 (rows) × 12 (columns) matrix, with no missing values and no duplicated rows. Each row represents one wine, and the columns are the various features of each wine, such as its degrees of alcohol, the acidity of the wine, the pH level, which measures how acidic or basic the solution etc. The variable type is either 1 (white), 2 (rose), or 3 (red) and the classes are almost perfectly balanced, thus making easier the preprocessing stage of the problem.
The dataset, consist of twelve feature variables, out of which only one (quality) is categorical, while all the others are numeric. The target variable (type) is also a target variable with three levels type=1, type=2, type=3 and There are no missing values. In Figure 1, the target variable type is represented by the orange barplot. The number of rows in each class is balanced; therefore, no preprocessing to upsample (downsample) the underrepresented (overrepresented) classes is needed.
On the variable quality, the majority of cases are between 5 and 7, which would make the very poor and very good wines difficult to predict. Finally, the features, are measured on vastly different scales. For example pH is ranging from 2.7 to 3.6, while free.residual.dioxide is ranging from 0 to 300. This means that centring and scaling the data matrix is crucial for any algorithm to work properly.
Extending the univariate analysis of the previous subsection, the bivariate analysis reveals how the features correlate to each other. Figure 2 (a) and Figure 2 (b) are both producing the same information but presented differently. On Figure 2 (a) the exact values of the linear relationships are shown, while on Figure 2 (b) clear clusters of high and low linear correlation are formed. Out of these clusters that are present in Figure 2 (b), some are expected, but others are not. For measuring the correlation, the Pearson’s correlation coefficient is used, which measures the covariance of two random variables, X
and y
, and normalizes it with the product of their standard deviations.
For example, Figure 2 (a) shows a strong positive relationship between free.sulfur.dioxide and total.sulfur.dioxide, which is expected, as total.sulfur.dioxide is the free.sulfur.dioxide plus other sulfur dioxides from other ingredients. Intrestingly, the same is not true regarding volatile.acidity, fixed.acidity and citric.acid. The feature with the most strong correlations however, is density, as it strongly correlates, both positively or negatively with alcohol, type, residual.sugar and total.sulfur.dioxide. In general, we identify two clusters. One with strong correlations containing type, residual.sugar, density, alcohol, total.sulfure.dioxide and free.sulfure.dioxide, and a second one with low correlations containing fixed.acidity, pH, citric.acid, sulphates and volatile.acidity. The model is going to leverage these relationships in order to predict the outcome.
At this point I would really like to go furhter than the standard bivariate analysis and plot three variables at a time. This start now to become unmanageble as with 12 different features, there are 220 unique permutations of 3 features. I cannot check them one by one on this simple case, let alone bigger cases. Hence, we relly on dimensionality reduction techniques. These techniques can become quite advanced, but here we will stick with Principal Component Analysis. The takehome message is that PCA will give us in order what is the most influential combination of features, and we will see that visually!
Using the singular value decomposition to factorise the data matrix X, we produce three matrixes, U, S and V , each carrying information about the dataset in some aspect. First of all, we can create the following graph of the percentage of variance explained by each principal component. In Figure 3, we can see the cumulative variance explained by each additional principal component. This is a critical graph as we can reduce the dimension of the data matrix by selecting a threshold (for example, 99% of the variance). In this example, it is clear that with only the first four principal components, that threshold is surpassed. The principal components are a linear combination of all the dataset columns, and each one is perpendicular to the previous one.
Therefore, we can continue the exploration by visualising these four principal components’ weights (loadings) and trying to understand whether these components make sense according to our knowledge of the topic. In Figure 4 the first four principal components are visualised along with the loadings of each column. By on the two most extreme positive and negative values of the loadings, we can interpret the components into new kinds of variables.
It is clear that this dimensionality reduction technique produces results that reflect the real world. Sulfur Dioxide is a vital component in winemaking as it regulates bacteria growth among other essential tasks, yet it also gives unpleasant odours and tastes to the wine. PCA immediately captures that reality by assigning the two most significant negative loadings on sulfur dioxide concentration (both free and total) to the first principal component. In addition to that, the second most important component in order to classify the wines is the origin of that Sulfur Dioxide. The total Sulfur Dioxide is the Free Sulfur Dioxide plus Sulfur Dioxide bound to other ingredients such as sugars etc. After investigating how much SO2 a wine has and where it comes from (free vs total), the next most important factors have to do with the specific taste of the wine, mainly how sweet and how acidic the taste the wine has.
With the information revealed from PCA, we can now plot the highest and the lowest loadings of the four principal components for each level of the target variable. The reason for plotting this graph is to investigate whether some level of the type variable behaves differently than the others when plotted on the same axis. In all four subplots of Figure 7 we identify the same exact behaviour.
The coefficient of the regression is negative on the total.sulfure.dioxide against alcohol axis for all levels, positive on the total.sulfure.dioxide against free.sulfure.dioxide axis, and close to zero for the other two PC scores. This finding, while not very exciting is important because it confirms once again that the relationship of our target variable type and the features identified at the dimensionality reduction stage (total.sulfure.dioxide, free.sulfure.dioxide, alcohol, residual.sugar, and fixed.acidity) are important for establishing a good model.
When exploring for the first time a new dataset, a basic roadmap is the following.
The reason we go through all this trouble is that now, we collected evidence! These evidences so us what we expect to be important in our final algorithm. It should really be a surprise if our final statistical learning machine suggests that the variable chlorides is the very important, because we never found anything special about it in our EDA. Of course these are not rules, but a suggestion and this is why I am calling them evidences of what should play an important role in our end product and what not.
This is one of the most important goals of exploring the dataset. Help us build an understanding of what should make sense in our upcoming modeling.