How to Identify Data for Predictive Analytics - dummies

How to Identify Data for Predictive Analytics

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

For your predictive analytics project, you’ll need to identify appropriate sources of data, pool data from those sources, and put it in a structured, well-organized format. These tasks can be very challenging and will likely require careful coordination among different data stewards across your organization.

You’ll also need to select the variables you’re going to analyze. This process must take data constraints, project constraints, and business objectives into consideration.

The variables you select must have predictive power. Also, you need to consider variables that are both valuable and feasible for your project within the budget and timeframes. For example, if you’re analyzing bank transactions in a criminal investigation, phone records for all parties involved may be relevant to the analysis but not accessible to the analysts.

Expect to spend considerable time on this phase of the project. Data collection, data analysis, and the process of addressing data content, quality, and structure can add up to a time-consuming to-do list.

During the process of data identification, it helps to understand your data and its properties; this knowledge will help you choose which algorithm to use to build your model. For example, time series data can be analyzed by regression algorithms; classification algorithms can be used to analyze discrete data.

Variable selection is affected by how well you understand the data. Don’t be surprised if you have to look at and evaluate hundreds of variables, at least at first. Fortunately, as you work with those variables and start gaining key insights, you start narrowing them down to a few dozen. Also, expect variable selection to change as your understanding of the data changes throughout the project.

You may find it beneficial to build a data inventory that you can use to track what you know, what you don’t know, and what might be missing. The data inventory should include a listing of the various data elements and any attributes that are relevant in the subsequent steps of the process.

For example, you may want to document whether any segments are missing zip codes or missing records for a specific period of time.

Your go-to people for business knowledge (also known as domain knowledge experts) will help you select the key variables that can positively influence the results of your project. They can help explain to you the importance of these variables, as well as where and how to get them, among other valuable input.