Principal Component Analysis (PCA): Data Analysis Explained

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This technique is widely used in data analysis and is a fundamental tool in the field of machine learning and data science.

The main goal of PCA is to identify patterns in data and express the data in a way that highlights their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analyzing data.

Table of Contents

Understanding Principal Component Analysis

PCA is a dimensionality reduction or data compression method. The goal in dimensionality reduction is to preserve as much of the important data as possible, packed into fewer dimensions, discarding the “noise”.

It is not an algorithm that you can use to classify or cluster data, it’s a process to change your perspective of the data, to look at it from a different angle, where the features are not correlated. This is done to understand the data better and to improve the performance of other algorithms (like regression, classification, clustering) that you might want to apply to the data.

How PCA Works

PCA works by identifying the hyperplane that lies closest to the data, and then it projects the data onto it. The direction of the vector is the first principal component. The orthogonal direction in the hyperplane is the second principal component. For higher-dimensional datasets, PCA would find a third component orthogonal to the other two, and so forth.

The principal components are a straight line, and the first principal component holds the most variance in the data. Each succeeding principal component accounts for the highest possible variance and is orthogonal to the preceding components. This reduces the dimensionality of the data while retaining the variation in the data.

Applications of PCA

PCA is predominantly used in exploratory data analysis and for making predictive models. It is used for finding patterns in data of high dimension, visualizing genetic distance and relationships among populations, for computer graphics, stock market analysis, climate analysis, and more.

PCA is also used in the field of finance to construct portfolios of stocks. In the field of neuroscience, PCA is a common method used for spike sorting, and in image compression and recognition, PCA is a powerful tool.

Steps in Principal Component Analysis

PCA involves a mathematical procedure that transforms a number of correlated variables into a smaller number of uncorrelated variables called principal components. The transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component has the highest possible variance under the constraint that it is orthogonal to the preceding components.

The steps involved in PCA are the calculation of the data covariance matrix, the calculation of the eigenvalues and eigenvectors of the covariance matrix, the sorting of eigenvalues and their corresponding eigenvectors, and the computation of the cumulative and explained variance.

Data Standardization

The first step in PCA is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis. More specifically, the reason why standardization is critical is that the initial variables are measured on different scales.

Standardization involves rescaling the variables to have a mean of zero and a standard deviation of one. The general method of calculation is to determine the distribution mean and standard deviation for each variable. Then subtract the mean from each of the variables and divide the result by the standard deviation.

Computing the Covariance Matrix

The next step is to compute the covariance matrix of the data for the features in the dataset. Covariance is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. We calculate the covariance matrix because it is the data structure that will be used in the Eigen decomposition operation in the next step.

The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, the covariance between two different variables xi and xj is equal to the covariance between xj and xi.

Interpreting Principal Component Analysis

Interpreting PCA requires understanding of the concept of variance, covariance, eigenvalues and eigenvectors. The principal components are less interpretable and do not have any real meaning since they are constructed as linear combinations of the initial variables.

Graphical representations are often used to better understand and interpret PCA. These can be a scatter plot of the first two principal components, or a scree plot of the explained variance.

Scree Plot

A scree plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each PC. The PCs are arranged by decreasing order of the eigenvalues. Eigenvalues represent the amount of variance in the total sample accounted for by each factor.

The scree plot helps you to determine the optimal number of principal components that should be retained in order to describe the data. This can be determined by finding the “elbow” in the graph which is the point of inflection on the curve.

Loadings

Loadings are the weights by which each standardized original variable should be multiplied to get the component score. Loadings can be interpreted as the correlations between the original variables and the component.

Loadings are a measure of how much the original variables contribute to the component. The square of the loading is the percent of the variance in that variable explained by the component. Therefore, high loadings (either positive or negative) indicate that the component strongly depends on that variable.

Advantages and Disadvantages of PCA

PCA has many advantages, such as the ability to handle data of high dimensionality and the ability to reveal relationships between variables. It also reduces the dimensionality of the data set, which can simplify other analyses.

However, PCA also has some disadvantages. The main disadvantage is that the principal components are less interpretable than the original data. They do not have any real meaning since they are constructed as linear combinations of the initial variables.

Advantages of PCA

PCA can reveal relationships between variables that were not originally apparent. It can also confirm suspected relationships. PCA reduces the dimensionality of the data set, simplifying other analyses. It also handles data of high dimensionality, which can be difficult to manage otherwise.

PCA can also be used to filter noise from the data, and find patterns in the data. By reducing the dimensionality, it can simplify the task of building a model to describe the data and the model will be more robust and less likely to overfit.

Disadvantages of PCA

The main disadvantage of PCA is that the principal components are not as interpretable as the original data. They do not have any real meaning since they are constructed as linear combinations of the initial variables. This can make it difficult to interpret the results of the PCA.

Another disadvantage is that PCA assumes that the principal components are orthogonal. This may not always be the case, and if the principal components are not orthogonal, then PCA may not give the best dimensionality reduction.

Conclusion

Principal Component Analysis is a powerful statistical tool that is widely used in data analysis and visualization. It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

PCA is a versatile tool in the field of data analysis. Whether it is used for data visualization, noise filtering, feature extraction and engineering, or even for building predictive models, PCA has its place in every data scientist’s toolbox.