How to Extract, Transform, and Load Data for Predictive Analysis - dummies

How to Extract, Transform, and Load Data for Predictive Analysis

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

After it’s initially collected, data is usually in a dispersed state; it resides in multiple systems or databases and must be analyzed before predicting anything. Before you can use it for a predictive analytics model, you have to consolidate it into one place. Also, you don’t want to work on data that resides in operational systems — that’s asking for trouble.

Instead, place a portion of it somewhere where you can work on it freely without affecting operations. ETL (extract, transform and load) is the process that achieves that desirable state.

Many organizations have multiple databases; your predictive model will likely utilize data from all of them. ETL is the process that collects all the information needed and places it in a separate environment where you can run your analysis.

ETL is not, however, a once-and-for-all operation; usually it’s an ongoing process that refreshes the data and keeps it up to date. Be sure you run your ETL processes at night or at other times when the load on the operational system is low.

  • The extraction step collects the desired data in its raw form from operational systems.

  • The transformation step makes the collected data ready to be used in your predictive model — merging it, generating the desired derived attributes, and putting the transformed data in the appropriate format to fit business requirements.

  • The loading step places the data in its designated location, where you can run your analysis on it — for example, in a data mart, data warehouse, or another database.

You should follow a systematic approach to build your ETL processes to fulfill the business requirements. It’s a good practice to keep a copy of the original data in a separate area so you can always go back to it in case an error disrupts the transformation or the loading steps of the processes.

Your ETL process should incorporate modularity — separating the tasks and accomplishing the work in stages. This approach has advantages in case you want to reprocess or reload the data, or if you want to use some of that data for a different analysis or to build different predictive models. The design of your ETL should be able to accommodate major business requirement changes — with minimal changes to your ETL process.