Explaining Principal Component Analysis using Python in Machine Learning

Here in this post, we will see the maths behind Principal Component Analysis using Python and then will also see how to implement it using Sklearn library.

Whatever you do in your day to day life, you are generating a tremendous amount of data that can be used by business to improve their products, to offer you better and relevant services. For example When you use social networking websites like Facebook, Twitter they collect data of what you like or love, share/retweet, post/tweet and even your location. In fact, 90% of the data in the world has been generated in the last 3 years. But not only the data is increasing but the features in the data are also increasing.

Datasets with a large number of features are called high-dimensional datasets. If there are a lot of features in your dataset it will affect in both ways i.e. time taken to train a model as well as the accuracy of machine learning models. There are various ways by which we can reduce the features variables in our data.

So, in a nutshell, Principal Component Analysis or PCA is all about finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.

Steps we follow to explain Principal Component Analysis using Python in Machine Learning:

Steps we will follow in this post for explaining Principal Component Analysis using Python:

Understanding the basics
Perform Exploratory Data Analysis on Boston Housing Dataset
Standardize the data
Creating Covariance Matrix
Finding Eigen Vectors and Eigenvalues
Plotting Variance of Principal Component
Principal Component Analysis using sklearn

Understanding the Basics:

Variance

Variance is a measure of variation of a single random variable. Variance is given by

where n is the number of samples or data points and x bar is the mean of the variable x.

Covariance

Covariance is a measure of how much two random variables vary together and covariance σ(x,y) of two random variables x and y is given by

where n is the number of samples or data points and both x bar and y bar are mean of variable x and y respectively. Also, the variance of a random variable $x$ can be also expressed as the covariance with itself by $σ (x, x)$

Covariance Matrix

The covariance matrix sometimes called the variance-covariance matrix is a representation of the covariance between each feature in the original dataset. The formula to calculate the covariance matrix is given by

Principal Component Analysis

It is the most fundamental technique of dimensionality reduction and used to transform high-dimensional datasets into a dataset with fewer features (or to transform a dataset into low dimension), where the remaining features explain the maximum variance within the dataset. It is called Principal Component Analysis because it learns from the ‘Principal Components’ of the data.

Principal Component Analysis aka PCA performs dimension reduction in two steps. First, it performs decorrelation which will rotate the data samples so that they are aligned with the coordinate axis and the resulted samples or features are not linear correlated. Also, it will shift the data samples so that they have zero mean. In the second step, it will reduce dimension.

Eigen Vector and Eigenvalues:

The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data along the new feature axes. An eigenvector of a matrix A is a vector represented by a matrix X such that when X is multiplied with matrix A, then the direction of the resultant matrix remains same as vector X.

Mathematically, the above statement can be represented as:

AX = λX

where A is an arbitrary matrix, λ are eigenvalues and X is an eigenvector corresponding to each eigenvalue.

Here, we can see that AX is parallel to X. So, X is an eigenvector.

Perform Exploratory Data Analysis (EDA) on Boston Housing Dataset

We have already performed EDA on the Boston Housing Dataset and selected important features using heatmap and .corr() method, which you can check here.

Standardize the data by transforming it onto a unit scale (mean=0 and variance=1)

scaler = StandardScaler()
X_std=scaler.fit_transform(X)
print(X_std)
output-
[[-0.41978194  0.28482986 -1.2879095  ... -1.45900038  0.44105193
  -1.0755623 ]
 [-0.41733926 -0.48772236 -0.59338101 ... -0.30309415  0.44105193
  -0.49243937]
 [-0.41734159 -0.48772236 -0.59338101 ... -0.30309415  0.39642699
  -1.2087274 ]
 ...
 [-0.41344658 -0.48772236  0.11573841 ...  1.17646583  0.44105193
  -0.98304761]
 [-0.40776407 -0.48772236  0.11573841 ...  1.17646583  0.4032249
  -0.86530163]
 [-0.41500016 -0.48772236  0.11573841 ...  1.17646583  0.44105193
  -0.66905833]]

Creating Covariance Matrix

We can create a covariance matrix by using two different processes:

Manually

mean_X = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_X).T.dot((X_std - mean_X)) / (X_std.shape[0]-1)
print('Covariance matrix is {}'.format(cov_mat))

Using NumPy’s .cov() method

print('covariance matrix is {}' .format(np.cov(X_std.T)))

Both codes, as shown above, will give the same output as shown below:

Covariance matrix is [[ 1.0019802 -0.20086619 0.40738853 -0.05600226 0.42180532 -0.21968085
0.29051973 -0.38582644 0.4565237 ]
[-0.20086619 1.0019802 -0.53488527 -0.04278127 -0.51762669 0.31260839
-0.39245415 0.17586788 -0.41381239]
[ 0.40738853 -0.53488527 1.0019802 0.06306266 0.76516363 -0.39245145
0.38400646 -0.35768342 0.60499536]
[-0.05600226 -0.04278127 0.06306266 1.0019802 0.09138341 0.09143192
-0.1217558 0.0488851 -0.05403609]
[ 0.42180532 -0.51762669 0.76516363 0.09138341 1.0019802 -0.30278658
0.1893068 -0.38080321 0.59204898]
[-0.21968085 0.31260839 -0.39245145 0.09143192 -0.30278658 1.0019802
-0.35620546 0.12832224 -0.61502373]
[ 0.29051973 -0.39245415 0.38400646 -0.1217558 0.1893068 -0.35620546
1.0019802 -0.17773456 0.374785 ]
[-0.38582644 0.17586788 -0.35768342 0.0488851 -0.38080321 0.12832224
-0.17773456 1.0019802 -0.36681183]
[ 0.4565237 -0.41381239 0.60499536 -0.05403609 0.59204898 -0.61502373
0.374785 -0.36681183 1.0019802 ]]

Finding Eigen Vectors and Eigenvalues

eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors: {}' .format(eig_vecs))
print('\nEigenvalues: {}' .format(eig_vals))
Output-
Eigenvectors: [[ 3.12224012e-01 -1.95517544e-02 -4.68369150e-01  4.57555286e-02
  -1.54886863e-01  3.28439583e-01 -6.32566548e-01 -3.85107103e-01
   1.89801963e-02]
 [-3.38150316e-01 -9.70077933e-02 -3.86151614e-01 -9.93079011e-02
  -2.16046531e-02 -6.59970498e-01 -5.96957508e-02 -2.44495854e-01
   4.66796636e-01]
 [ 4.34246222e-01  1.91286171e-01  6.68143835e-02  6.21895561e-01
  -3.04663129e-01 -4.91372191e-01 -9.17394909e-02  1.85108131e-01
  -8.16955461e-02]
 [-1.34291163e-02  7.48288801e-01  2.88012523e-01 -5.71548134e-03
   1.99175832e-02  3.87572397e-02  1.18570873e-01 -5.61696212e-01
   1.59496486e-01]
 [ 4.10440590e-01  3.29070944e-01 -4.70250342e-02 -7.11540271e-01
  -2.06246196e-02 -2.37661160e-01 -1.62915636e-01  3.62278427e-01
   7.38642360e-03]
 [-3.12440932e-01  3.39105183e-01 -3.27301069e-01  1.21931526e-01
   4.87061251e-01 -1.69041127e-01 -1.90882353e-01  8.61089252e-02
  -5.96439996e-01]
 [ 2.83701037e-01 -3.87948948e-01  2.24318644e-01 -2.23638377e-01
   7.40202010e-02 -3.40593243e-01  9.28789297e-02 -5.46622854e-01
  -4.89663732e-01]
 [-2.66141125e-01 -9.56589197e-02  6.19690773e-01 -6.13061973e-03
   9.54222179e-02 -1.12052221e-01 -7.08190854e-01  6.84294266e-02
   8.99225352e-02]
 [ 4.26640617e-01 -1.03661286e-01  2.07133681e-02  1.73346630e-01
   7.93781166e-01 -6.64708942e-04  2.61234588e-02  2.66566436e-03
   3.82067405e-01]]

Eigenvalues: [3.77954317 1.17109261 1.05268106 0.19845201 0.28869715 0.42678896
 0.6014857  0.72703603 0.7720451 ]

Plotting Variance of Principal Component

To plot first make a list of the eigenvalue, eigenvector tuples as shown below:

eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

Then sort the tuples in descending order i.e from high variance to low variance using python code as shown below:

eig_pairs.sort(key=lambda x: x[0], reverse=True)

Now we can print the list as of eigenvalues as shown below:

print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])
output-
Eigenvalues in descending order:
3.77954317196245
1.1710926095589433
1.0526810592103752
0.7720451020290618
0.727036025137825
0.6014857024034839
0.42678895636646985
0.2886971497872246
0.19845200572239327

Now to find the percentage of variance by each principal components we can write a below code:

total = sum(eig_vals)
var_exp = [(i / total)*100 for i in sorted(eig_vals, reverse=True)]

Finally Plotting the Principal components in respect of descending order of variance as shown below:

with plt.style.context('fivethirtyeight'):
plt.figure(figsize=(6, 4))
plt.bar(range(9), var_exp, alpha=0.5, align='center', label='explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()

Here’s how you can perform Principal Component Analysis using Python on a dataset in Machine Learning.

Principal Component Analysis using sklearn

The sklearn library provides a quick and easy way that performs all the above process that we have under the hood.

from sklearn.decomposition import PCA
pca = PCA(n_components=5)
X = pca.fit_transform(X)

For complete information on how to use Principal Component Analysis using sklearn, check out our post here. Also, you can check the official Sklearn documentation here. Read more articles on Machine Learning from here. You can also download the complete Jupyter Notebook from Github.

Hope you liked our post and learned the basics behind Principal Component Analysis using Python. Do share it. 🙂