Dimension Reduction with PCA

-- by Charlie Chengrui Zheng 01/20/21

When you have a extremely large dataset, it is slow to train and predict the dataset with your machine learning model. How can you enhance the performance of this machine learning process? One way of solving this problem is to reduce the dimension of the feature set.

Principal component analysis (PCA) is the most common method for dimension reduction. In this article I will use PCA to showcase dimension reduction on 'banknote authentication' dataset from UCI Machine Learning Repository

Dataset

This dataset contains 1372 rows × 5 columns. The last column 'class' is the target set and the first 4 columns are the 4 dimensions of the feature set. For the description of the columns, please refer to the repository link above. This is a great dataset for binary classification because the class is either 0 or 1. However, I will not focus on classify this dataset but to reduce the dimensionality of the feature set.

To preprocess this dataset, we need to first divide the dataset according the 2 classes

Since the values of the featureset are continous, we can assume Gaussian distribusion on these data. We need to standardize the featureset.

PCA

To conduct a Principal Component Analysis, we are trying to project the data onto Principal Components, the linearly uncorrelated orthogonal axes. We want our Principal Components to captures as much variance as possible in order to have the best dimensionality-reduced representation of the dataset. Therefore, when we do a PCA, we need to check the proportion of variance explained by each component and decide which component to use. Here is the example of the 4 principal components of our dataset.

We can see that PC1 explains the most proportion of variance, and PC2 explains major proportion. However, PC3 and PC4 explains little proportion of variance. Therefore, we can discard PC3 and PC4, reducing the dimensionality of the dataset from 4 to 2. The sum of the proportion of variance explained by PC1 and PC2 is 0.8683

In this way, we cut half of the size of the original dataset by explaining 86.83% its variance. Because it is now 2-Dimensional, we can even visualize the new dataset. Here is the new feature set:

From the graph above, we can see that our classes are roughly separable by the 2 PCs. We successfully have a new dataset that have reduced-size, and is good for classification.