Applying Principal Component Analysis to Predictive Analytics - dummies

Applying Principal Component Analysis to Predictive Analytics

By Dr. Anasse Bari, Mohamed Chaouchi, Tommy Jung

Principal component analysis (PCA) is a valuable technique that is widely used in predictive analytics and data science. It studies a dataset to learn the most relevant variables responsible for the highest variation in that dataset. PCA is mostly used as a data reduction technique.

While building predictive models, you may need to reduce the number of features describing your dataset. It’s very useful to reduce this high dimensionality of data through approximation techniques, at which PCA excels. The approximated data summarizes all the important variations of the original data.

For example, the feature set of data about stocks may include stock prices, daily highs and lows, trading volumes, 200-day moving averages, price-to-earning ratios, relative strength to other markets, interest rates, and strength of currencies.

Finding the most important predictive variables is at the core of building a predictive model. The way many have been doing it is by using a brute force approach. The idea is to start with as many relevant variables as you can, and then use a funnel approach to eliminating features that have no impact, or no predictive value.

The intelligence and insight is brought to this method by engaging business stakeholders, because they have some hunches about which variables will have the biggest impact in the analysis. The experience of the data scientists engaged in the project is also important in knowing what variables to work with and what algorithms to use for a specific data-type or a domain-specific problem.

To help with the process, data scientists employ many predictive analytics tools that make it easier and faster to run multiple permutations and analyses on a dataset in order to measure the impact of each variable on that dataset.

Knowing that there is a large amount of data to work with, you can employ PCA for help.

Reducing the number of variables you look at is reason enough to employ PCA. In addition, by using PCA you’re automatically protecting yourself from overfitting the model.

Certainly, you could find correlation between weather data in a given country and the performance of its stock market. Or with the color of a person’s shoes and the route she or he takes to the office, and the performance of their portfolio for that day. However, including those variables in a predictive model is more than just overfitting, it’s misleading and leads to false predictions.

PCA uses a mathematically valid approach to determine the subset of your dataset that includes the most important features; in building your model on that smaller dataset, you will have a model that has predictive value for the overall, bigger dataset you’re working with. In short, PCA should help you make sense of your variables by identifying the subset of variables responsible for the most variation with your original dataset. It helps you spot redundancy. It helps you find out that two (or more variables) are telling you the same thing.

Moreover, principal components analysis takes your multidimensional dataset and produces a new dataset whose variables are representative of the linearity of the variables in the original dataset. In addition, the outputted dataset has individually un-correlated variables, and their variance is ordered by their principal components where the first one is the largest, and so on. In this regard, PCA can also be considered as a technique for constructing features.

While employing PCA or other similar techniques that help reduce the dimensionality of the dataset you’re dealing with, you have to always exercise caution to not affect the performance of the model negatively. Reducing the size of the data should not come at the expense of negatively impacting the performance (the accuracy of the predictive model). Tread safely and manage your dataset with care.

The increased complexity of a model doesn’t translate to higher quality in the outcome.

To preserve the performance of the model, you may need to carefully evaluate the effectiveness of each variable, measuring its usefulness in the shaping of the final model.

Knowing that the PCA can be especially useful when the variables are highly correlated within a given dataset, then having a dataset with non-correlated predictive variables can only complicate the task of reducing the dimensionality of multivariate data. Many other techniques can be used here in addition to the PCA, such as forward feature selection and backward feature elimination.

PCA is not a magic bullet that will solve all issues with multi-dimensional data. Its success is highly dependent on the data you’re working with. The statistical variance may not align to variables with the most predictive values, even though it is safe to work with such approximations.