Data Science: Using Python to Perform Factor and Principal Component Analysis

Python Essentials For Dummies

Data scientists can use Python to perform factor and principal component analysis. SVD operates directly on the numeric values in data, but you can also express data as a relationship between variables. Each feature has a certain variation. You can calculate the variability as the variance measure around the mean. The more the variance, the more the information contained inside the variable.

In addition, if you place the variable into a set, you can compare the variance of two variables to determine whether they correlate, which is a measure of how strongly they have similar values.

Checking all the possible correlations of a variable with the others in the set, you can discover that you may have two types of variance:

Unique variance: Some variance is unique to the variable under examination. It cannot be associated to what happens to any other variable.
Shared variance: Some variance is shared with one or more other variables, creating redundancy in the data. Redundancy implies that you can find the same information, with slightly different values, in various features and across many observations.

Of course, the next step is to determine the reason for shared variance. Trying to answer such a question, as well as determining how to deal with unique and shared variances, led to the creation of factor and principal component analysis.

Considering the psychometric model

Long before many machine-learning algorithms were thought up, psychometrics, the discipline in psychology that is concerned with psychological measurement, tried to find a statistical solution to effectively measure dimensions in personality. The human personality, as with other aspects of human beings, is not directly measurable. For example, it isn’t possible to measure precisely how much a person is introverted or intelligent. Questionnaires and psychological tests only hint at these values.

Psychologists knew of SVD and tried to apply it to the problem. Shared variance attracted their attention: If some variables are almost the same, they should have the same root cause, they thought. Psychologists created factor analysis to perform this task! Instead of applying SVD directly to data, they applied it to a newly created matrix tracking the common variance, in the hope of condensing all the information and recovering new useful features called factors.

Looking for hidden factors

A good way to show how to use factor analysis is to start with the Iris dataset.

from sklearn.datasets import load_iris
from sklearn.decomposition import FactorAnalysis
iris = load_iris()
X, y = iris.data, iris.target
factor = FactorAnalysis(n_components=4, , random_state=101).fit(X)

After loading the data and having stored all the predictive features, the FactorAnalysis class is initialized with a request to look for four factors. The data is then fitted. You can explore the results by observing the components_ attribute, which returns an array containing measures of the relationship between the newly created factors, placed in rows, and the original features, placed in columns.

At the intersection of each factor and feature, a positive number indicates that a positive proportion exists between the two; a negative number, instead, points out that they diverge and one is the contrary to the other.

You’ll have to test different values of n_components because it isn’t possible to know how many factors exist in the data. If the algorithm is required for more factors than exist, it will generate factors with low values in the components_ array.

import pandas as pd
print pd.DataFrame(factor.components_,columns=iris.feature_names)
 sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0   0.707227   -0.153147   1.653151   0.701569
1   0.114676   0.159763   -0.045604   -0.014052
2   0.000000   -0.000000   -0.000000   -0.000000
3   -0.000000   0.000000   0.000000   -0.000000

In the test on the Iris dataset, for example, the resulting factors should be a maximum of 2, not 4, because only two factors have significant connections with the original features. You can use these two factors as new variables in your project because they reflect an unseen but important feature that the previously available data only hinted at.

Using components, not factors

If an SVD could be successfully applied to the common variance, you might wonder why you can’t apply it to all the variances. Using a slightly modified starting matrix, all the relationships in the data could be reduced and compressed in a similar way to how SVD does it.

The results of this process, which are quite similar to SVD, are called principal components analysis (PCA). The newly created features are named components. In contrast to factors, components aren’t described as the root cause of the data structure but are just restructured data, so you can view them as a big, smart summation of selected variables.

For data science applications, PCA and SVD are quite similar. However, PCA isn’t affected by the scale of the original features (because it works on correlation measures that are all bound between -1 and +1 values) and PCA focuses on rebuilding the relationship between the variables, thus offering different results from SVD.

Achieving dimensionality reduction

The procedure to obtain a PCA is quite similar to the factor analysis. The difference is that you don’t specify the number of components to extract. You decide later how many components to keep after checking the explained_variance_ratio_ attribute, which provides quantification of the informative value of each extracted component. The following example shows how to perform this task:

from sklearn.decomposition import PCA
import pandas as pd
pca = PCA().fit(X)
print ‘Explained variance by component: %s’ % pca.explained_variance_ratio_
print pd.DataFrame(pca.components_,columns=iris.feature_names)
Explained variance by component: [ 0.92461621 0.05301557 0.01718514 0.00518309]
 sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0   0.361590   -0.082269   0.856572   0.358844
1   -0.656540   -0.729712   0.175767   0.074706
2   0.580997   -0.596418   -0.072524   -0.549061
3   0.317255   -0.324094   -0.479719   0.751121

In this decomposition of the Iris dataset, the vector array provided by explained_variance_ratio_ indicates that most of the information is concentrated into the first component (92.5 percent). It’s therefore possible to reduce the entire dataset to just two components, providing a reduction of noise and redundant information from the original dataset.

About This Article

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.