Exploratory Data Analysis (EDA) is a critical step in the data analysis pipeline that involves the initial investigation of data to uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. It is a method used to understand and summarize the main characteristics of a dataset, often with visual methods. EDA is used to see what the data can tell us before the modeling or hypothesis testing task.
It is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of data in hand, before getting them dirty with it. It is a comprehensive process with the objective to understand the nuances of the dataset. EDA involves generating summary statistics for numerical data in the dataset and creating various graphical representations to understand the data better.
Importance of EDA
EDA is a fundamental step in the data analysis process. It helps to identify the patterns, relationships, or anomalies to inform future studies. EDA also helps to determine the best way to manipulate data sources to get the answers you need, making it easier for you to discover patterns, spot anomalies, frame your hypothesis, and test it. This process can also help you detect mistakes or errors in your data which could potentially save you a lot of time and trouble.
Moreover, EDA provides a critical bridge between the initial raw data analysis and the formal modeling, driving the development of complex models, algorithms, and data systems. By using EDA, data analysts are better equipped to dissect data in a methodical way, which can then be leveraged to create predictive models or inform algorithm creation.
Understanding Data Structures
EDA is essential in understanding the data structures in the dataset. It helps in identifying the variables and observations. Variables are generally the columns in the dataset and observations are the rows. Understanding the data structures helps in manipulating the data in the right way to get the desired results. It also helps in identifying the type of variables in the dataset.
There are two types of variables: categorical and numerical. Categorical variables are the variables that can be divided into multiple categories but having no order or priority. Numerical variables are the variables that have numerical values and can be divided into two categories: discrete and continuous. Discrete variables can only take certain numerical values while continuous variables can take any numerical value within a certain range.
Summary Statistics
Summary statistics include measures such as mean, median, mode, standard deviation, range, etc. These measures help to understand the central tendency and dispersion of the data. The mean gives the average value of the data, the median gives the middle value and the mode gives the most occurring value in the dataset. Standard deviation gives the amount of variation in the dataset. The range gives the difference between the highest and the lowest value in the dataset.
Summary statistics are very useful in understanding the overall distribution of the data. They give a quick snapshot of the data and help to describe the basic features of the data. In EDA, summary statistics are the first step to understand the data and drive the further process of data analysis.
Data Visualization
Data visualization is a key part of EDA. It involves the creation and study of the visual representation of data. It is an important step as it involves the representation of data in a graphical or pictorial format. It helps to understand trends, patterns, and to make correlations. They’re important to understand the data in a visual way and to find any patterns or outliers in the dataset.
Data visualization is a quick and easy way to convey concepts in a universal manner. It can be much easier to visualize complex data and relationships between multiple variables. You can use different types of graphs to visualize your data such as bar graphs, histograms, scatter plots, etc. The type of graph you choose will depend on the kind of data you have and the question you are trying to answer.
Univariate Analysis
Univariate analysis is the simplest form of data analysis where the data being analyzed contains only one variable. Since it’s a single variable it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it. The patterns can be in the form of variability or central tendency.
Common techniques used in univariate analysis include data visualization tools like bar charts and pie charts, and summary statistics like mean, median, mode, range, variance, maximum, minimum, quartiles, and standard deviation. Univariate analysis is used to outline patterns and find outliers in the preliminary investigation stage.
Bivariate Analysis
Bivariate analysis is used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y. It aims to compare these two sets to find out the cause and effect relationship, or correlation, or to make predictions. It usually involves scatterplots, correlation coefficients, and regression lines.
It can reveal trends, correlations, patterns, or relationships between the two variables. This can be useful for testing simple hypotheses of association. Bivariate analysis can be a springboard for more complex statistical analyses when patterns in data suggest that more complex models are required.
Handling Missing Data
Handling missing data is an important step in the EDA process. Missing data in the dataset can lead to a biased or incorrect prediction. Hence, it is necessary to handle missing values appropriately. There are several strategies to handle missing data, some of them include deletion, imputation, and prediction.
Deletion is the process of deleting the rows with missing data. This is the simplest way to handle missing data but it is not generally advised as it can lead to loss of information. Imputation is the process of substituting the missing data with some statistical measures like mean, median, mode, etc. Prediction is a method where missing values are predicted based on other data.
Imputation Techniques
Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean or median imputation is one of the most common methods of imputation. In this method, the mean or median of the non-missing values is calculated and used to replace the missing values.
Another method of imputation is regression imputation. In this method, a regression model is used to predict the missing values. The variable with missing data is taken as the dependent variable and the remaining variables are taken as independent variables. The regression model is built based on the complete cases and then used to predict the missing values.
Deletion Techniques
Deletion techniques are often used when the nature of missing data is “Missing Completely at Random” that means the missingness of data is independent of any other variable. In this case, deleting observations with missing data will not create any bias. Listwise deletion (or complete case analysis) and pairwise deletion are two main types of deletion techniques.
Listwise deletion is used when the nature of missing data is “Missing Completely at Random”. It simply excludes the cases where one or more values are missing. Pairwise deletion is used when the nature of missing data is “Missing at Random”. It deletes the cases where the particular variable that’s missing is needed.
Outlier Detection
Outliers are extreme values that deviate from other observations on data, they may indicate a variability in a measurement, experimental errors or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample. Outliers can be of two types: Univariate and Multivariate.
Univariate outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions. Outlier detection can be performed using scatter plots, box plots, and notably the Z-score method.
Z-Score Method
The Z-score is a mathematical measurement that describes a value’s relationship to the mean of a group of values. It is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean.
Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean. In most cases, a threshold of -3 or 3 is used to identify the outliers. Data point that falls below -3 standard deviations can be considered an outlier. Similarly, if the Z-score is greater than 3, that data point can be considered an outlier.
Box Plot Method
A box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.
Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers.
Conclusion
Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial stages of EDA, you should feel free to investigate every idea that occurs to you. Some of these will pan out, and some will be dead ends. As you get deeper into the analysis, you can start to narrow your focus.
EDA is a critical first step in analyzing the data from an experiment. Here the focus is on making sense of the data in hand, not on making a robust hypothesis test. The purpose of EDA is to use summary statistics and visualizations to better understand data, and find clues about the tendencies of the data, its quality and to formulate assumptions and the hypothesis of our analysis.