Preparing Your Data for Predictive Analytics

By Dr. Anasse Bari, Mohamed Chaouchi, Tommy Jung

When you’ve defined the objectives of the model, the next step in predictive analytics is to identify and prepare the data you’ll use to build your model. The following information touches upon the most important activities. The general sequence of steps looks like this:

  1. Identify your data sources.
    Data could be in different formats or reside in various locations.
  2. Identify how you will access that data.
    Sometimes, you would need to acquire third-party data, or data owned by a different division in your organization, etc.
  3. Consider which variables to include in your analysis.

    One standard approach is to start off with a wide range of variables and eliminate the ones that offer no predictive value for the model.

  4. Determine whether to use derived variables.
    In many cases, a derived variable (such as the price-per-earning ratio used to analyze stock prices) would have a greater direct impact on the model than would the raw variable.
  5. Explore the quality of your data, seeking to understand both its state and limitations.
    The accuracy of the model’s predictions is directly related to the variables you select and the quality of your data. You would want to answer some data-specific questions at this point:

    • Is the data complete?
    • Does it have any outliers?
    • Does the data need cleansing?
    • Do you need to fill in missing values, keep them as they are, or eliminate them altogether?

Understanding your data and its properties can help you choose the algorithm that will be most useful in building your model. For example:

  • Regression algorithms can be used to analyze time-series data.
  • Classification algorithms can be used to analyze discrete data.
  • Association algorithms can be used for data with correlated attributes.

Individual algorithms and predictive techniques have different weaknesses and strengths. Most important, the accuracy of the model relies on having both a great quantity and quality of data. Your data should have a sufficient number of records to provide statistically meaningful results.

Gathering relevant data (preferably many records over a long time period), preprocessing, and extracting the features with most predictive values will be where you spend the majority of your time. But you still have to choose the algorithm wisely, an algorithm that should be suited to the business problem.

Data preparation is specific to the project you’re working on and the algorithm you choose to employ. Depending on the project’s requirements, you will prepare your data accordingly and feed it to the algorithm as you build your model to address the business needs.

The dataset used to train and test the model must contain relevant business information to answer the problem you’re trying to solve. If your goal is (for example) to determine which customer is likely to churn, then the dataset you choose must contain information about customers who have churned in the past in addition to customers who have not.

Some models created to mine data and make sense of its underlying relationships — for example, those built with clustering algorithms — needn’t have a particular end result in mind.

Underfitting

Underfitting is when your model can’t detect any relationships in your data. This is usually an indication that essential variables — those with predictive power — weren’t included in your analysis.

If the variables used in your model don’t have high predictive power, then try adding new domain-specific variables and re-run your model. The end goal is to improve the performance of the model on the training data.

Another issue to watch for is seasonality (when you have seasonal pattern, if you fail to analyze multiple seasons you may get into trouble.) For example, a stock analysis that includes only data from a bull market (where overall stock prices are going up) doesn’t account for crises or bubbles that can bring major corrections to the overall performance of stocks. Failing to include data that spans both bull and bear markets (when overall stock prices are falling) keeps the model from producing the best possible portfolio selection.

Overfitting

Overfitting is when your model includes data that has no predictive power but it’s only specific to the dataset you’re analyzing. Noise — random variations in the dataset — can find its way into the model, such that running the model on a different dataset produces a major drop in the model’s predictive performance and accuracy.