/machine-learning/bootcamp/18-principal-component-analysis/18-principal-component-analysis.md

18 - Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set.
Using PCA it's possible to transform a set of features into principal components. These principal components are linear combinations of the original features. The first principal component is the linear combination that accounts for the largest possible variance in the data set. The second principal component must be orthogonal to the first principal component and must account for the largest possible variance given this restriction. The third principal component must be orthogonal to the first two and so on.
Principal components don't have a 1 to 1 correlation with the original features, it's a combination of all the features and doesn't have a direct interpretation.
PCA is an unsupervised learning algorithm, it doesn't use the labels.

Python Implementation

Scikit-learn has a PCA class that can be used to perform PCA in Python. The PCA class takes a parameter n_components that specifies the number of principal components to return. If n_components is not set then all components are returned.

Dataset

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

%matplotlib inline

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

df = cancer['data']

Scaling

Once data is scaled, we can use PCA to find the first two principal components, that means we are going to find the first two principal components that explain the most variance in the data.
It's important to scale data firstly because PCA is affected by scale, so you need to scale the features in your data before applying PCA. Use StandardScaler to help you standardize the dataset’s features onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaled_df = scaler.transform(df)

PCA

from sklearn.decomposition import PCA

pca = PCA(
    # number of components to keep
    n_components=2
)

pca.fit(scaled_df)

x_pca = pca.transform(scaled_df)

Dataset before transformation has 30 features:

df.shape
#(569, 30)

Dataset after transformation has 2 features:

x_pca.shape
#(569, 2)

Plotting

By plotting the first two principal components, we can see a clear separation between the two classes, malignant and benign

plt.scatter(
    # all the rows of col 1 on x axis
    x_pca[:,0],
    # all the rows of col 2 on y axis
    x_pca[:,1],
    # color the points based on the target, color points based on malignant or benign
    c=cancer['target'],
    cmap='plasma'
)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

the power of PCA is that by compressing the data into two dimensions, we still have a very good separation between the two classes and a big variance explained.

We can also look at the components themselves, each row represents a principal component, and each column relates back to the original features:

pca.components_

# array([[ 0.21890244,  0.10372458,  0.22753729,  0.22099499,  0.14258969,
#          0.23928535,  0.25840048,  0.26085376,  0.13816696,  0.06436335,
#          0.20597878,  0.01742803,  0.21132592,  0.20286964,  0.01453145,
#          0.17039345,  0.15358979,  0.1834174 ,  0.04249842,  0.10256832,
#          0.22799663,  0.10446933,  0.23663968,  0.22487053,  0.12795256,
#          0.21009588,  0.22876753,  0.25088597,  0.12290456,  0.13178394],
#        [-0.23385713, -0.05970609, -0.21518136, -0.23107671,  0.18611302,
#          0.15189161,  0.06016536, -0.0347675 ,  0.19034877,  0.36657547,
#         -0.10555215,  0.08997968, -0.08945723, -0.15229263,  0.20443045,
#          0.2327159 ,  0.19720728,  0.13032156,  0.183848  ,  0.28009203,
#         -0.21986638, -0.0454673 , -0.19987843, -0.21935186,  0.17230435,
#          0.14359317,  0.09796411, -0.00825724,  0.14188335,  0.27533947]])

In order to get back to the original features, we can use inverse_transform:

pca.inverse_transform(x_pca)

or simply:

df_comp = pd.DataFrame(pca.components_, columns=cancer['feature_names'])

# mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
# 0	0.218902	0.103725	0.227537	0.220995	0.142590	0.239285	0.258400	0.260854	0.138167	0.064363	...	0.227997	0.104469	0.236640	0.224871	0.127953	0.210096	0.228768	0.250886	0.122905	0.131784
# 1	-0.233857	-0.059706	-0.215181	-0.231077	0.186113	0.151892	0.060165	-0.034768	0.190349	0.366575	...	-0.219866	-0.045467	-0.199878	-0.219352	0.172304	0.143593	0.097964	-0.008257	0.141883	0.275339

We can visualize this better using a heatmap that represents the correlation between the various feature and the principal component itself.

plt.figure(figsize=(12,6))
sns.heatmap(df_comp, cmap='plasma')

It's finally possible to use a classification algorithm to predict the target using the PCA components instead of the original data.

x_pca.shape

Python

PythonCIGOmachine-learningnumpypandasscikit-learnseabornsklearnmatplotlibscalingmd