How to Prepare Data for a Predictive Analysis Model
When you’ve defined the objectives of the model for predictive analysis, the next step is to identify and prepare the data you’ll use to build your model. The general sequence of steps looks like this:
Identify your data sources.
Data could be in different formats or reside in various locations.
Identify how you will access that data.
Sometimes, you would need to acquire third-party data, or data owned by a different division in your organization, etc.
Consider which variables to include in your analysis.
One standard approach is to start off with a wide range of variables and eliminate the ones that offer no predictive values for the model.
Determine whether to use derived variables.
In many cases, a derived variable (such as the price-per-earning ratio used to analyze stock prices) would have a greater direct impact on the model than would the raw variable.
Explore the quality of your data, seeking to understand both its state and limitations.
The accuracy of the model’s predictions is directly related to the variables you select and the quality of your data. You would want to answer some data-specific questions at this point:
Is the data complete?
Does it have any outliers?
Does the data need cleansing?
Do you need to fill in missing values, keep them as they are, or eliminate them altogether?
Understanding your data and its properties can help you choose the algorithm that will be most useful in building your model. For example:
Regression algorithms can be used to analyze time-series data.
Classification algorithms can be used to analyze discrete data.
Association algorithms can be used for data with correlated attributes.
The dataset used to train and test the model must contain relevant business information to answer the problem you’re trying to solve. If your goal is (for example) to determine which customer is likely to churn, then the dataset you choose must contain information about customers who have churned in the past in addition to customers who have not.
Some models created to mine data and make sense of its underlying relationships — for example, those built with clustering algorithms — need not have a particular end result in mind.
Two problems arise when dealing with data as you’re building your model: underfitting and overfitting.
Underfitting is when your model can’t detect any relationships in your data. This is usually an indication that essential variables — those with predictive power — weren’t included in your analysis. For example, a stock analysis that includes only data from a bull market (where overall stock prices are going up) doesn’t account for crises or bubbles that can bring major corrections to the overall performance of stocks.
Failing to include data that spans both bull and bear markets (when overall stock prices are falling) keeps the model from producing the best possible portfolio selection.
Overfitting is when your model includes data that has no predictive power but it is only specific to the dataset you’re analyzing. Noise — random variations in the dataset — can find its way into the model, such that running the model on a different dataset produces a major drop in the model’s predictive performance and accuracy. The accompanying sidebar provides an example.
If your model performs just fine on a particular dataset and only underperforms when you test it on a different dataset, suspect overfitting.