Structuring Your Data for Predictive Analytics
Raw data is a potential resource for predictive analytics, but it can’t be usefully analyzed until it’s been given a consistent structure. Data residing in multiple systems has to be collected and transformed to get it ready for analysis. The collected data should reside in a separate system so it won’t interfere with the live production system. While building your model, split your dataset into a training dataset to train the model, and a test dataset to validate the model.
Extracting, transforming and loading your data
After it’s initially collected, data is usually in a dispersed state; it resides in multiple systems or databases. Before you can use it for a predictive analytics model, you have to consolidate it into one place. Also, you don’t want to work on data that resides in operational systems — that’s asking for trouble. Instead, place a portion of it somewhere where you can work on it freely without affecting operations. ETL (extract, transform and load) is the process that achieves that desirable state.
Many organizations have multiple databases; your predictive model will likely utilize data from all of them. ETL is the process that collects all the information needed and places it in a separate environment where you can run your analysis. ETL is not, however, a once-and-for-all operation; usually it’s an ongoing process that refreshes the data and keeps it up to date. Be sure you run your ETL processes at night or at other times when the load on the operational system is low.
- The extraction step collects the desired data in its raw form from operational systems.
- The transformation step makes the collected data ready to be used in your predictive model — merging it, generating the desired derived attributes, and putting the transformed data in the appropriate format to fit business requirements.
- The loading step places the data in its designated location, where you can run your analysis on it — for example, in a data mart, data warehouse, or another database.
You should follow a systematic approach to build your ETL processes to fulfill the business requirements. It’s a good practice to keep a copy of the original data in a separate area so you can always go back to it in case an error disrupts the transformation or the loading steps of the processes. The copy of the original data serves as a backup that you can use to rebuild the entire dataset employed by your analysis if necessary. The goal is to head off Murphy’s Law and get back on your feet quickly if you have to rerun the entire ETL process from scratch.
Your ETL process should incorporate modularity — separating the tasks and accomplishing the work in stages. This approach has advantages in case you want to reprocess or reload the data, or if you want to use some of that data for a different analysis or to build different predictive models. The design of your ETL should be able to accommodate even major business requirement changes — with only minimal changes to your ETL process.
Keeping the data up to date
After the loading step of ETL, after you get your data into that separate database, data mart, or warehouse, you’ll need to keep the data fresh so the modelers can rerun previously built models on new data.
Implementing a data mart for the data you want to analyze and keeping it up to date will enable you to refresh the models. You should, for that matter, refresh the operational models regularly after they’re deployed; new data can increase the predictive power of your models. New data can allow the model to depict new insights, trends, and relationships.
Having a separate environment for the data also allows you to achieve better performance for the systems used to run the models. That’s because you aren’t overloading operational systems with the intensive queries or analysis required for the models to run.
Data keeps on coming — more of it, faster, and in greater variety all the time. Implementing automation and the separation of tasks and environments can help you manage that flood of data and support the real-time response of your predictive models.
To ensure that you’re capturing the data streams and that you’re refreshing your models while supporting automated ETL processes, analytical architecture should be highly modular and adaptive. If you keep this design goal in mind for every part you build for your overall predictive analytic project, the continuous improvement and tweaking that go along with predictive analytics will be smoother to maintain and will achieve better success.
Outlining testing and test data
When your data is ready and you’re about to start building your predictive model, it’s useful to outline your testing methodology and draft a test plan. Testing should be driven by the business goals you’ve gathered, documented, and collected all necessary data to help you achieve.
Right off the bat, you should devise a method to test whether a business goal has been attained successfully. Because predictive analytics measure the likelihood of a future outcome — and the only way to be ready to run such a test is by training your model on past data, you still have to see what it can do when it’s up against future data. Of course, you can’t risk running an untried model on real future data, so you’ll need to use existing data to simulate future data realistically. To do so, you have to split the data you’re working on into training and test datasets.
Be sure that you select these two datasets at random, and that both datasets contain and cover all the data parameters you’re measuring.
When you split your data into test and training datasets, you’re effectively avoiding any overfitting issues that could arise from overtraining the model on the entire dataset and picking up all the noise patterns or specific features that only belong to the sample dataset and aren’t applicable to other datasets.
Separating your data into training and test datasets, about 70 percent and 30 percent respectively, ensures an accurate measurement of the performance of the predictive analytics model you’re building. You want to evaluate your model against the test data because it’s a straightforward way to measure whether the model’s predictions are accurate. Succeeding here is an indication that the model will succeed when it’s deployed. A test dataset will serve as an independent set of data that the model hasn’t yet seen; running your model against the test dataset provides a preview of how the model will perform when it goes live.