In this exploration phase of predictive analysis, you’ll gain intimate knowledge of your data — which in turn will help you choose the relevant variables to analyze. This understanding will also help you evaluate the results of your model. But first you have to identify and clean the data for analysis.

How to generate derived data

Derived attributes are entirely new records constructed from one or more existing attributes. An example would be the creation of records identifying books that are bestsellers at book fairs. Raw data may not capture such records — but for modeling purposes, those derived records can be important. Price-per-earnings ratio and 200-day moving average are two examples of derived data that are heavily used in financial applications.

Derived attributes can be obtained from simple calculation such as deducing age from birth date. Derived attributes can also be computed by summarizing information from multiple records.

For example, converting a table of customers and their purchased books into a table can enable you to track the number of books sold via a recommender system, through targeted marketing, and at a book fair — and identify the demographic of customers who bought those books.

Generating such additional attributes bring additional predictive power to the analysis. In fact, many such attributes are created so as to probe their potential predictive power. Some predictive models may use more derived attributes than the attributes in their raw state. If some derived attributes prove especially predictive and their power is proven to be relevant, then it makes sense to automate the process that generates them.

Derived records are new records that bring in new information and provide new ways of presenting raw data; they can be of tremendous value to predictive modeling.

How to reduce the dimensionality of your data

The data used in predictive models is usually pooled from multiple sources. Your analysis can draw from data scattered across multiple data formats, files, and databases, or multiple tables within the same database. Pooling the data together and combining it into an integrated format for the data modelers to use is essential.

If your data contains any hierarchical content, it may need to be flattened. Some data has some hierarchical characteristics such as parent-child relationships, or a record that is made up of other records. For example, a product such as a car may have multiple makers; flattening data, in this case, means including each maker as an additional feature of the record you’re analyzing.

Flattening data is essential when it merged from multiple related records to form a better picture.

For example, analyzing adverse events for several drugs made by several companies may require that the data be flattened at the substance level. By doing so, you end up removing the one-to-many relationships (in this case, many makers and many substances for one product) that can cause too much duplication of data by repeating multiple substance entries that repeat product and maker information at each entry.

Flattening reduces the dimensionality of the data, which is represented by the number of features a record or an observation has.

For example, a customer can have the following features: name, age, address, items purchased. When you start your analysis, you may find yourself evaluating records with many features, only some of which are important to the analysis. So you should eliminate all but the very few features that have the most predictive power for your specific project.

Reducing the dimensionality of the data can be achieved by putting all the data in a single table that uses multiple columns to represent attributes of interest. At the beginning of the analysis, of course, the analysis has to evaluate a large number of columns — but that number can be narrowed down as the analysis progresses.

This process can be aided by reconstituting the fields — for example, by grouping the data in categories that have similar characteristics.

The resultant dataset — the cleaned dataset — is usually put in a separate database for the analysts to use. During the modeling process, this data should be easily accessed, managed, and kept up to date.