Principal Component Analysis- A Brief Understanding

Srinidhi Devan
Analytics Vidhya
Published in
4 min readJan 28, 2021

--

Principal Component Analysis (PCA) is an unsupervised technique for reducing the dimension of the data. The idea behind PCA is to seek the most accurate data representation in a lower-dimensional space. In other words, to represent the data in a lower dimension such that when we move along the dimension, we will be moving along the data. The amount of uncertainty and the amount of variation that is present in the high dimensional data should be entirely captured in the lower dimension. However, in many cases, this might not be possible and therefore, representation cannot become all kinds of dimensions in the data. PCA tries to preserve the largest variances in the data.

Consider the linear combination of the variables:

C = w1 * y1 + w2 * y2 + w3 * y3

where

C: Consolidated representation of the features (Components)

w1, w2, w3: PCA component loadings

y1, y2, y3: Scaled features

Most of the variability in the data is captured by PC1 (Principal Component 1) and the residual variability is captured by PC2 (Principal Component 2), which is orthogonal (independent) to PC1. PC2 tries to capture the abnormal variations in the dataset. PC1 and PC2 have 0 correlation.

Steps for Performing PCA:

1. Standardization of the data i.e., the data should be centered on the origin.

2. Generate the covariance/correlation matrix for all the dimensions.

The covariance/Correlation matrix captures the variation between the different variables in the original dimensions.

3. Covariance/Correlation matrix is decomposed into the coordinate axes that rotate the dataset. It makes sure that the rotated version captures most of the variability in it. These rotated axes are called Eigen Vectors and the corresponding Eigen Values are the magnitude of the variance captured.

4. Sort the Eigen pairs in the descending order of the Eigen Values and select the one with the largest value. This is the PC1 which covers the maximum information from the original data.

5. Finally, to see how many PC will be useful can be seen using Scree Plot. The more the PC, the more is the variance explained. The lesser the PC, the more is the dimension reduction and compression in the data.

When do we apply PCA?:

1. To reduce the dimensions or features of the data

2. Pattern recognition based on the features of the data

3. To resolve the multicollinearity issue

When the independent variables are highly correlated with each other, the coefficients lose their stability and interpretability, this is the multicollinearity issue.

Consider the following equation:

Y = β0 + β1 * PC1 + β2 * PC2

Where PC1 & PC2 are independent. Therefore, by definition, PCA solves the multicollinearity problem by creating two features that are independent of each other.

Signal to Noise Ratio:

The variation along the line which is capturing the variability is called Signal and everything around this signal is Noise. The Noise represents the different aspects of the data that the signal is unable to pick up. It is due to the random factor as far as PC1 is concerned but this is considered a signal to PC2.

In other words, PCA is the sequential way to extract the signals from the data. As we keep extracting signals from the noise, we keep extracting PCA one after the other. The more the signal extracted better is the PCA performance.

The quality of signal extraction is called Signal to Noise Ratio (SNR).

Greater SNR implies PCA being able to extract signals from the data with fewer dimensions.

Improving SNP through PCA:

It is important that we center the variance i.e., mean is subtracted from all the points on both dimensions (xi — x̅) and (yi — ȳ). This transforms the origin of the space from (x̅, ȳ) to (0,0). So (0,0) of the coordinate system becomes the center of the data. So now even as we rotate it around the new coordinate system, the center does not change and still remains as (0,0). Therefore ‘centering’ is crucial so that the rotation does not hurt the values themselves and allows us to capture the variation and to reduce the total error of representation. This implies SNR is maximized.

Performance Issues of PCA:

1. PCA’s effectiveness depends upon the scales of the attributes. If the attributes under consideration have different scales, PCA will pick variables with the highest variance rather than picking up attributes based on correlation.

2. Changing the scales of the variables, changes the PCA.

3. Interpreting PCA can become challenging due to the presence of discrete data. Scaling discrete data becomes difficult.

4. Presence of skewness in the data with a long thick tail can impact the effectiveness of the PCA. Variance is the squared standard deviation and is symmetric. Skewness does not mean symmetricity. Hence skewness affects the notion of variance and therefore the notion of PCA.

5. PCA, in general, assumes a linear relationship between attributes and is ineffective when the relationship are non-linear. There are versions of PCA which can capture non-linear relationship but a standard PCA cannot.

--

--

Srinidhi Devan
Analytics Vidhya

Data enthusiast with a knack for decoding numbers and turning them into actionable insights.